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Preface 


This volume is, in a sense, the culmination of over 20 years of statistical work and over 15 years 
of personal interactions. One of us, Fienberg, was exposed to the ideas of the Grade of Member- 
ship (GoM) model in a workshop on disability forecasting (National Research Council, 1994). The 
GoM work was done in the context of the National Long Term Care Survey, and the modeling 
ideas were likelihood-based and quite opaque to Fienberg at the time. Several years later Fien- 
berg introduced Erosheva to the literature on the GoM model. A Bayesian approach to the GoM 
model, in the context of other latent structure models, became the main focus of Erosheva’s dis- 
sertation research at Carnegie Mellon University (Erosheva, 2002). Independently, a third editor 
of this volume, Blei, began his Ph.D. research with his advisor Michael Jordan at the University 
of California at Berkeley on topic modeling, developing a method referred to as latent Dirichlet 
allocation (Blei et al., 2003), which resembles a Bayesian GoM. Around that time, conversations 
with Tom Minka prompted Erosheva to look deeper into that resemblance. Shortly afterwards, in- 
spired by Matthew Stephens’s pointer to the admixture model from genetics — another model that 
resembles a Bayesian GoM — Erosheva formulated the more general mixed membership framework, 
encompassing all three approaches, and provided ways to construct mixed membership models for 
other data structures (Erosheva, 2003). Blei and Erosheva met only once, for coffee at the 2003 Joint 
Statistical Meetings in San Francisco. 

After completing his dissertation, Blei came to Carnegie Mellon as a postdoctoral fellow and 
began to collaborate on network modeling with Fienberg and Airoldi, and another colleague in the 
Machine Learning Department. This work culminated in the mixed membership stochastic block- 
model (Airoldi et al., 2008), and was also at the heart of Airoldi’s Ph.D. thesis (Airoldi, 2006). In 
light of these and many subsequent interactions and joint work, it was natural for the four of us to 
collaborate on the present volume. 

We have many people to thank for the completion of the present volume. First, we thank the 
authors of the chapters, many of whom are our friends, students, collaborators, and professional 
colleagues. They have contributed first-rate research to the volume, and the following pages are 
evidence of their great efforts and intellect. Second, we thank John Kimmel of Chapman & Hall, 
who gave us constant encouragement. Finally, and most of all, we thank Kira Bokalders. Kira is the 
real editor of this collection. She organized the effort, converted documents from varying formats, 
copy-edited every contribution, constructed the indexes, and guided us through by preparing the 
final camera-ready manuscript. Editors are listed in lexicographical order of their last names. 

Edoardo M. Airoldi, Cambridge, MA 
David M. Blei, New York City, NY 
Elena A. Erosheva, Seattle, WA 
Stephen E. Fienberg, Pittsburgh, PA 


July 4, 2014. 
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Mixed membership models have emerged over the past 20 years as a flexible cluster-like modeling 
tool for unsupervised analyses of high-dimensional multivariate data where the assumption that an 
observational unit belongs to a single cluster, or principal component, is violated. Instead, one as- 
sumes that every unit partially belongs to all clusters, according to an individual membership vector. 
Mixed membership models were introduced essentially independently in a number of different sta- 
tistical application settings: (1) survey data (Berkman et ah, 1989; Erosheva, 2002; Erosheva et ah, 
2007), (2) population genetics (Pritchard et ah, 2000b; Rosenberg et ah, 2002), (3) text analysis 
(Blei et ah, 2003; Erosheva et ah, 2004; Airoldi et ah, 2010), and then later on in (4) image pro- 
cessing and annotation (Barnard et ah, 2003; Fei-Fei and Perona, 2005), and (5) molecular biology 
(Segal et ah, 2005; Airoldi et ah, 2006; 2007; 2013). 


1.1 Historical Developments 

This volume chronicles recent developments in the area of mixed membership modeling. Mixed 
membership models are used to characterize complex multivariate data such as those arising in 
studies of genetic build-up of biological organisms, patterns in disease and disability manifestations. 
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combinations of topics covered by text documents, political ideology or electorate voting patterns, 
or heterogeneous relationships in networks. Early applications of mixed membership modeling in- 
cluded the admixture model in genetics (Pritchard et al., 2000a), the Grade of Membership model 
in medical classification studies (Manton et al., 1994b), and the latent Dirichlet allocation model in 
machine learning (Blei et al., 2003). 

In contrast to the finite mixture or parametric clustering models (McLachlan and Peel, 2000), 
mixed membership models assume that individuals or observational units may only partly belong 
to population mixture categories, referred to in various fields as topics , extreme profiles, pure or 
ideal types, states, or subpopulations. The degree of membership then is a vector of continuous 
non-negative latent variables that add up to 1 (in mixture models, membership is a binary indicator). 
The original idea for a mixed membership type of modeling goes back to at least the 1970s when 
the Grade of Membership (GoM) model was developed by mathematician Max Woodbury to allow 
for “fuzzy” classifications in medical diagnosis problems (Woodbury et al., 1978). The model had 
not received a lot of attention from statisticians in the early years, and was later characterized by 
seemingly controversial statements regarding the nature of the compositional data implied by the 
GoM model (Haberman, 1995). It was not until the early 2000s, with the widespread use of Bayesian 
methods and a better explanation of the duality between the discrete and continuous nature of latent 
structure in the GoM model, that a new Bayesian approach to the GoM model had been developed 
(Erosheva, 2003). The almost simultaneous and independent development of the admixture model 
in genetics (Pritchard et al., 2000a) and the latent Dirichlet allocation (LDA) model in computer 
science (Blei et al., 2003) also relied on the use of Bayesian estimation or approximate Bayesian 
estimation techniques, as in the case of LDA. This class of mixed membership models (Erosheva, 
2002) unifies the LDA, GoM, and admixture models in a common framework and provides ways to 
construct other individual-level mixture models by varying assumptions on the population, sampling 
unit and latent variable levels, and the sampling scheme. 

The word mixed in the name mixed membership comes from the alternative latent class specifi- 
cation of the models where each attribute is generated according to its distribution in a certain basis 
category (Erosheva et al., 2007). For example, each word in an article corresponds to a particular 
topic, whereas the article’s composition as a whole corresponds to the author’s intention to cover 
a selection of topics. Thus, the multivariate collection of outcomes for each sampling unit is com- 
posed of a mix of attributes that originate from the basis categories, e.g., words within a document 
that are generated from topics covered by that document. In the case of discrete data, the latent 
topic indicators for each word do not necessarily have to be the latent variables in the model. An 
alternative data-generating process that results in the same likelihood can be based on the latent de- 
grees of membership controlling the proportions of attributes originating from each basis category 
(Erosheva, 2005). For this reason, mixed membership models have been occasionally referred to as 
partial membership models (e.g., Erosheva, 2004); however, that name has not gained widespread 
use and the name mixed membership remains the most commonly used descriptor (Erosheva and 
Fienberg, 2005). 


1.2 A General Formulation for Mixed Membership Models 

The general mixed membership model relies on four levels of assumptions: population, subject, la- 
tent variable, and sampling scheme. Population level assumptions describe the general structure of 
the population that is common to all subjects. Subject level assumptions specify the distribution of 
observed responses given individual membership scores. Membership scores are usually unknown 
and hence can also be viewed as latent variables. The next assumption specifies whether the mem- 
bership scores are treated as unknown fixed quantities or as random quantities in the model. Finally, 
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the last level of assumptions specifies the number of distinct observed characteristics (attributes) and 
the number of replications for each characteristic. We describe each set of assumptions formally in 
turn. 

Population Level 

Assume there are K original or basis subpopulations in the populations of interest. For each sub- 
population k, denote by f(xj\6kj) the probability distribution for response variable j, where dkj 
is a vector of parameters. Assume that within a subpopulation, responses to observed variables are 
independent. 

Subject Level 

For each subject, membership vector A = (Ai, . . . , A k) provides the degrees of a subject’s mem- 
bership in each of the subpopulations. The probability distribution of observed responses Xj for 
each subject is fully defined by the conditional probability Pr(xj\ A) = A kf(xj\@kj), and the 
assumption that response variables Xj are independent, conditional on membership scores. In addi- 
tion, given the membership scores, observed responses from different subjects are independent. 

Latent Variable Level 

With respect to the latent variables, one could either assume that they are fixed unknown constants 
or that they are random realizations from some underlying distribution. 

1. If the membership scores A are fixed but unknown, the conditional probability of observing Xj, 
given the parameters 6 and membership scores, is 

K 

Pr(xj\X:0) = ^ A fc /(sj|flfcj). (1.1) 

k = 1 

2. If membership scores A are realizations of latent variables from some distribution D a , parame- 
terized by vector a , then the probability of observing Xj given the parameters is: 

Pr(xj\a,d) = f (V; Afc/(xj|fl fc j)) dD a ( A). (1.2) 


Sampling Scheme 

Suppose R independent replications of J distinct characteristics are observed for one subject, 
{x^(\ . . . , x\p}? =l . Then, if the membership scores are treated as realizations from distribution 
D a , the conditional probability is 

nnx>/(4 r) iM 

j—1 r—l k—1 

When the latent variables are treated as unknown constants, the conditional probability for observ- 
ing R replications of J variables can be derived analogously. In general, the number of observed 
characteristics J need not be the same across subjects, and the number of replications R need not 
be the same across observed characteristics. 

One can obtain a number of mixed membership models using this general set up by specify- 
ing different choices of ,J and R, and different latent variable assumptions. For instance, the Grade 


) dD a ( A). (1.3) 


Prl {x 


Jr 


,( r )\R 

r -J fr= 1 


\a,e) = 
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of Membership model of Manton et al. (1994b) assumes polytomous responses are observed to 
J survey questions without replications and uses the fixed-effects assumption for the membership 
scores. Potthoff et al. (2000) employ a variation of the Grade of Membership model by treating 
the membership scores as Dirichlet random variables; the authors refer to the resulting model as a 
Dirichlet generalization of latent class models. In genetics, Pritchard et al. (2000a) use a clustering 
model with admixture, which they labeled as structure. For diploid individuals the clustering model 
assumes that R = 2 replications (genotypes) are observed at J distinct locations (loci), treating the 
proportions of a subject’s genome that originated from each of the basis subpopulations as random 
Dirichlet realizations. Variations of mixed membership models for text documents called proba- 
bilistic latent semantic analysis (Hofmann, 2001) and latent Dirichlet allocation (Blei et al., 2003) 
both assume that a single characteristic (word) is observed a number of times for each document, 
but the former model considers the membership scores as fixed unknown constants, whereas the 
latter treats them as random Dirichlet realizations. 

The mixed membership model framework presented above unifies several specialized models 
that have been developed independently in the social sciences, genetics, and text mining applica- 
tions. 


1.3 Advantages of Mixed Membership Models in Applied Statistics 

Mixed membership models have had a significant impact on applied statistics. Over the past decade, 
the data that statisticians analyze have become more diverse and structured, and with this complexity 
comes the opportunity to model individual data points as belonging to multiple groups. Indeed, for 
many modern datasets — such as large-scale text documents and complex networks — we believe 
that there is rarely a case for the simpler models. Statisticians need mixed membership models or 
alternatives to them, and this is the reason to study them. 

The main areas to which mixed membership models have been applied are reflected in the 
contents of this volume. 

Document Collections 

Mixed membership models are widely applied to document collections (Blei et al., 2003; Blei, 
2012). In document collections, the mixed membership assumptions naturally capture the hetero- 
geneity of language, where documents each exhibit multiple themes and to different degree. When 
modeling documents as data, each document is a collection of words from a vocabulary. (These are 
grouped as categorical data.) Mixed membership models allow each document to exhibit multiple 
components, where each component is a distribution over words. Conditioned on a collection, in- 
specting the posterior of the components reveals the “topics” inherent in the documents, i.e., the 
significant patterns of words associated under a single theme. For this reason, mixed membership 
models of text are often called topic models. 

Mixed membership models for text have been extended in a myriad of ways and developed for 
many text-based applications. As examples, they have been developed into time series (Blei and 
Lafferty, 2006), into further hierarchicalized models of word contagion (Doyle and Elkan, 2009), 
into Bayesian nonparametric variants (Teh et al., 2006), and into models of interconnected docu- 
ments (Chang and Blei, 2009). In some ways, mixed membership models of text have become a 
benchmark for new innovations in mixed membership modeling. 
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Another central application of mixed membership models is for the analysis of network data. A net- 
work consists of a population of units and their relationships, represented via a graph with a set of 
nodes and edges between them. Networks arise naturally in sociological settings, co-author analysis, 
and a variety of biological problems. A classical latent-variable model of networks is the stochas- 
tic blockmodel (Wang and Wong, 1987), which assumes that each node belongs to a community, 
and that its assigned community mediates its connection to other nodes. While these assumptions 
may have been appropriate for small scale network analysis, modern networks are heterogenous. 
Nodes belong to multiple communities, and each node’s connections reflect its particular signature 
of community memberships. This is a natural setting for mixed membership models. 

Airoldi et al. (2008) developed the mixed membership extension of the stochastic blockmodel. 
Each node possesses an associated membership vector containing community proportions; each 
edge (present or absent) is associated with a community assignment drawn from the corresponding 
nodes’ proportions. Note that modeling networks is fundamentally different from modeling docu- 
ments because the observations are by definition intertwined. (We typically assume that documents, 
in contrast, are conditionally independent.) Mixed membership network models remain an active 
area of research. Further innovations include modeling dynamic networks (Ho et al., 2011) and 
including node attributes in modeling (Kim and Leskovec, 2011; Azari and Airoldi, 2012; Azizi 
et al., 2014). More broadly, networks are a type of dyadic data — data with entries indexed by a row 
and column — for which we can conceive more general mixed membership models (Mackey et al., 
2010 ). 

Social and Health Sciences Applications 

The earliest mixed membership model, the Grade of Membership model (GoM) was developed by 
the statistician Max Woodbury (Woodbury et al., 1978), in the context of a medical classification 
problem where subsets of symptoms were observed on each patient. The goal was to identify and 
characterize sub-patterns of illness in a particular disease such as depression (Davidson et al., 1989), 
schizophrenia (Manton et al., 1994a), and Alzheimer’s (Corder and Woodbury, 1993). GoM model 
analysis has been applied extensively to disability survey data — to analyze patters in binary indi- 
cators of basic and instrumental activities of daily living — in a frequentist (Berkman et al., 1989; 
Manton et al., 1991) and Bayesian framework (Erosheva et al., 2007). Mixed membership method- 
ologies have been extended to longitudinal settings to capture heterogeneous pathways of disability 
and cognitive trajectories at the later portion of life (this volume: Manrique-Vallier, 2014 and Lecci, 
2014). In political science, researchers have used mixed membership models to analyze politically- 
oriented beliefs, values, and attitudes from survey data (this volume: Gross and Manrique-Vallier, 
2014) and have developed mixed membership models for rank data to analyze votes in Irish elec- 
tions (Gormley and Murphy, 2009). Other applications of mixed membership models include as- 
sessing the risk of privacy violations in databases (Manrique-Vallier and Reiter, 2012), and even 
reconstructing the contents of a city based on sparse archeological evidence (Mimno, 201 1). 

Population Genetics 

In computational biology, mixed membership models have had a tremendous impact, most notably 
following the structure model of Pritchard et al. (2000a). In this setting, we observe a collection of 
human genomes in which each is a collection of alleles (A,G,C,T) measured at different locations. 
The model assumes that there are ancestral populations, groups of original humans that share a 
unique genetic signature, which migrated around the world and mixed. The observed genomes — the 
data we are analyzing — reflect the results of that mixing. Each genome exhibits the populations with 
different proportions, and each population is characterized by its allele probabilities across genome 
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locations. Posterior inference of the proportions and populations reveals the latent genetic structure 
of modern humans. 

This kind of analysis has been used in two ways. First, as for networks and text, it is useful for 
exploring genetic patterns and forming hypotheses about our genetic history. Second, it is important 
for correcting analyses that seek to find associations between genes and traits. Patterns in ancestral 
populations, though not observed, are a confounder to making such associations; inferences from 
mixed membership models are useful in accounting for them. In this volume Shringarpure and Xing 
(2014) discuss some interesting variants on the original Pritchard et al. (2000a). 


1.4 Theoretical Issues with Mixed Membership Models 

The early examples of original mixed membership models described above were developed for dis- 
crete data, involving multivariate binary data, multinomial data, and ranks, and researchers using 
them considered responses to survey questions, counts of words in a document, sequences of geno- 
types, presence or absence of interactions between units, etc. Even though the general formulation 
of mixed membership models allows for combining outcomes of different types in a single omnibus 
model (Erosheva, 2002), the theoretical properties of mixed membership models applied to contin- 
uous data and data of mixed outcomes and applications of mixed membership for such problems is 
quite limited, e.g., see the discussion in Heller et al. (2008) and the the analysis of gene expression 
data by Rogers et al. (2005). 

Extending mixed membership models to continuous data and data of mixed types is nontriv- 
ial. In this volume Galyardt (2014) demonstrates that the two interpretations — mixed attributes (the 
‘switching’ interpretation) and partial memberships (the ‘between’ interpretation) — which are typ- 
ically assumed as equivalent interpretations of mixed membership models, can not be taken for 
granted in the presence of continuous data. In fact, the ‘between’ interpretation no longer applies. 
Gruhl and Erosheva (2014) consider a broader class of individual-level mixture models and com- 
pare two members of this class — the mixed membership and the partial membership model (Heller 
et al., 2008) — for analyzing continuously-valued data. In essence, given individual-specific weights 
reflecting membership, mixed membership models assume that data are generated from individual- 
specific distributions that are weighted arithmetic averages of the subpopulation distributions, and 
partial membership models assume that individual-specific data are generated from a weighted geo- 
metric average of the subpopulation distributions. They explain that multivariate data my not provide 
researchers with a clear signal about the preferred type of individual-level mixture model. However, 
in this volume, analyzing a player statistics dataset from the National Basketball Association, Gruhl 
and Erosheva (2014) argue that the use of partial membership in that specific context is more ap- 
propriate. Partial membership models also happen to be more computationally convenient. Galyardt 
(2014) and Gruhl and Erosheva (2014) raise a number of issues for future work with individual- 
level mixture models for continuous data; some of these issues bear a clear connection to the large 
body of statistical literature on mixture models in general and on mixtures of normals in particular 
(McLachlan and Peel, 2000). 

1.4.1 General Issues Inherent to Mixtures 

While applications for mixed membership models especially in the form of extensions of topic 
models for text are widespread, these models suffer from a number of theoretical difficulties they 
inherit from mixture models. A lack of understanding of such issues may impact the validity of 
empirical analyses based on mixed membership models. Below we list a few key issues, borrowing 
material from a blog post on the topic by Wasserman (2012). 
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These issues are best illustrated in the context of a simple mixture model. Consider a finite 
mixture of Gaussians, 

k 

P{x; Wj 

3 = 1 

where (fix: p 3 . E ? ) denotes a Gaussian density with mean vector p 3 and covariance matrix E j. 
The weights w\,...,Wk are non-negative and sum to 1. The entire set of parameters is = 
(pi, . . . , p k . S l5 . . . , Efe, W\, . . . , wfi). One can also consider k, the number of components, to be 
another parameter. 

Now lets consider some of the weird things that can happen. 

Infinite Likelihood. The likelihood function (for the Gaussian mixture) is infinite at some points 
in the parameter space. This is not necessarily bad, since the infinities are at the boundary and one 
can use the largest (finite) maximum in the interior as an estimator. But the infinities can cause 
numerical problems. 

Multi-modality of the Likelihood. In fact, the likelihood has many modes (Richards and Buot, 
2006). Finding the global (but not infinite) mode is a difficult. The EM algorithm only finds local 
modes. In this sense, the MLE is not really a well-defined estimator because it cannot be found. In 
the machine learning literature, there have been a number of papers trying to establish estimators 
for mixture models that can be found in polynomial time. For example, see Kalai et al. (2012). 

Multi-modality of the Density. One may naively think that a mixture of k Gaussians would have 
k modes. But, in fact, it can have less than k or more than k. See Carreira-Perpinan and Williams 
(2003) and Edelsbrunner, Fasy, Rote (2012). 

Non-identifiability. Recall that a model {p(x\ 9) : 9 £ 0} is identifiable if 
6\ 0 2 implies p(x;9fi fip(x-,02). 

Mixture models are non-identifiable in two different ways. First, there is non-identifiability due to 
permutation of labels. This is a nuisance, and there are strategies to deal with it (Stephens, 2000). A 
bigger issue is local non-identifiability. Suppose that 

p(x; V, Mi, M2) = (1 - ri)fi (x; Mi, 1) + #(z; M2, !)■ 

When pi = P 2 = M, we have thatp(x; 77, Mi, M2) = <f>(x', p). The parameter rj has disappeared. Sim- 
ilarly, when // = 1, the parameter po disappears. This means that there are subspaces of the param- 
eter space where the family is not identifiable. The result is that all the usual theory about the 
distribution of the MLE, the distribution of the likelihood ratio statistic, the properties of BIC, and 
so on, becomes very complicated. 

Irregularity. Mixture models do not satisfy the usual regularity conditions that make parametric 
models easy to deal with. Consider the following example from Chen (1995). Let 

p(x\ d) = -9, 1) + ^>{x\ 2 0, 1). 

Then 7(0) =0 where I {6) is the Fisher information. Moreover, no estimator of 9 can converge faster 
than n -1 / 4 . Compare this to a Normal family fi(x; 9 , 1) where the Fisher information is 1(9) = n 
and the maximum likelihood estimator converges at rate ri l/2 . 

Non- intuitive Group Membership. Mixtures are often used for finding clusters. Suppose that 

p(x) = (1 - ri)fi(x\ Mi, ^i) + V0(x; M 2 , 02 ) 
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with pi < p 2 . Let Z = 1,2 denote the two components. We can compute P(Z = 1\X = x ) 
and P{Z = 2\X = x ) explicitly. We can then assign an x to the first component if 
P(Z = 1\X = x) > P(Z = 2\X = x). It is easy to check that, with certain choices of cti, o 2 , all 
large values of x get assigned to component 1 (i.e., the leftmost component). Technically this is 
correct, yet it seems to be an unintended consequence of the model. 

Improper Posteriors. Suppose we have a sample from the simple mixture 

P{x; p) = 0, 1) + ^f{x; p, 1). 

Then any improper prior on p yields an improper posterior for p regardless of how large the sample 
size is. Also, Wasserman (2012) shows that the only priors that yield posteriors in close agreement 
to frequentist methods are data-dependent priors. 

These issues are often exacerbated in more complex mixed membership models. They should 
be taken seriously. In most applications, however, available additional information can be used to 
mitigate, and sometimes resolve, the problems listed above. The papers we have collected in this 
volume provide good examples, and explain why we do not share Wasserman’s negative assessment: 
“that mixtures, like tequila, are inherently evil and should be avoided at all costs.” 
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Mixed membership models such as the Grade of Membership and latent Dirichlet allocation mod- 
els have primarily focused on the analysis of binary and categorical data. In this chapter, we will 
focus on exploring the performance of two different types of membership models with continu- 
ous data: one that has a classic mixed membership structure and one that has a partial membership 
structure. The Bayesian partial membership model was recently proposed by Heller et al. (2008) as 
a promising alternative to mixed membership motivated by continuous data. The Bayesian partial 
membership model based on exponential family distributions allows for computationally efficient 
modeling of a variety of data types. Heller et al. (2008) demonstrated a partial membership analysis 
of a discrete dataset. In this work, we use a dataset that has a collection of continuous variables 
describing NBA (National Basketball Association) players and their playing styles as a motivating 
example. Although NBA players are typically assigned to one of five player positions, the language 
used to describe players and playing styles is often suggestive of individual-level mixtures. In this 
chapter, we compare the exponential family form of the Bayesian partial membership model with 
the general mixed membership model on simulated binary and continuous data. We then extend the 
partial membership framework to account for correlated membership scores. Based on the proper- 
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ties of the two types of models and the nature of the NBA data, we argue for choosing a partial 
membership model over a mixed membership model in this case. We show how the NBA players 
can be modeled as individual-level mixtures using the correlated partial membership model. To our 
knowledge, this is the first individual-level mixture analysis of continuous data. 


2.1 Introduction 

Mixture models provide a model-based approach to clustering. Population-level mixture models de- 
scribe a population as a collection of subpopulations where each individual (or observational unit) 
belongs exclusively to one of the subpopulations (Lazarsfeld and Neil, 1968). Individual-level mix- 
ture models, on the other hand, allow each individual to belong to multiple subpopulations at once, 
with varying degrees of membership among individuals (Woodbury et al., 1978; Pritchard et al., 
2000; Blei et al., 2003; Erosheva, 2002). Because the instance of individuals belonging exclusively 
to one subpopulation is a special case of individuals belonging simultaneously to multiple subpopu- 
lations, individual-level mixture models can be viewed as a relaxation of population-level mixtures 
such as finite mixture or latent class models. 

The family of mixed membership models constitutes the predominant means of employing 
individual-level mixture models. At a high level, the mixed membership model assumes that data 
arise from individual-specific distributions that are arithmetic averages of the subpopulation distri- 
butions with individual-specific weights. Heller et al. (2008) formulated an alternative structure for 
individual-level mixtures, the Bayesian partial membership model, where the data can be viewed 
as arising from a (normalized) weighted geometric average of the subpopulation distributions with 
individual-specific weights. 

When the subpopulation distributions are of exponential family form, the partial membership 
model allows for computationally efficient, individual-level mixture modeling of a variety of data 
types. In this chapter, we concentrate on the exponential family form of the partial membership 
model and compare this model to corresponding mixed membership models for the binary data case 
and the continuous data case. We highlight the differences in the data-generating behavior between 
the two types of models which have connection to the work of Galyardt (2014). 

To demonstrate an individual-level mixture model analysis with continuous data, we use (NBA) 
National Basketball Association player statistics from the 2010-11 season (Hoopdata, 2012). The 
case of continuous data is of particular interest as existing individual-level mixture models have 
given less attention to continuous data. Even though the general class of mixed membership models 
(Erosheva, 2002) can accommodate any type of outcome (discrete or continuous), or even a mix of 
different types of outcomes in a model, the early independent developments have been motivated 
by discrete data problems whether in genetics, medicine, or computer science. Likewise, existing 
applications of mixed membership models primarily focus on binary, multinomial, and rank data: 
medical classification based on observed symptoms (Woodbury et al., 1978), counts of words in 
documents (Blei et al., 2003), responses to binary or multiple choice survey items such as disability 
manifestations (Erosheva et al., 2007), voter rankings of political candidates (Gormley and Murphy, 
2009), counts of features present in an image (Wang et al., 2009), presence or absence of interactions 
between units (Airoldi et al., 2008), etc. A mixed membership analysis of continuous gene expres- 
sion data with the latent process decomposition model by Rogers et al. (2005) is an exception. One 
reason for the continued focus of mixed membership models on discrete data is that little is known 
about this type of modeling for continuous data, and that basic examples of mixed membership for 
continuous data do not seem realistic. 

The rest of the chapter is organized as follows. We provide more background on the NBA player 
data in Section 2.2. Section 2.3 provides a review of the mixed membership model and the partial 
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membership model (no implied connection to the partial membership model in Erosheva, 2004). 
We compare the models and their applications to binary and continuous data in Section 2.4. Our 
comparison of these models for the continuous data case helps explicate our decision to use the 
partial membership for the NBA data analysis. In Section 2.5, we introduce an extension of the 
partial membership model that allows us to accommodate correlations among class membership 
scores. In Section 2.6, we analyze NBA playing styles using the correlated partial membership 
model. 


2.2 Compositional Playing Styles of NBA Players 

In the New York Times basketball blog Off The Dribble , Joshua Brustein highlighted NBA-related 
research presented at the 2012 MIT Sloan Sports Analytics Conference (Brustein, 2012). Team 
chemistry and construction were recurring themes in the research with the intent of understanding 
how team chemistry and construction might relate to winning. In understanding the team construc- 
tion process, comparing it across teams, and ultimately relating it to game outcomes, it is helpful to 
be able to group players by playing style and/or ability. 

Typically, basketball players are assigned to one of five positions: point guard (PG), shooting 
guard (SG), small forward (SF), power forward (PF), and center (C). Some players may play multi- 
ple positions. For instance, some players may play both the point guard and shooting guard positions 
or both the small forward and power forward positions. NBA observers may commonly use a more 
informal typology of players with three categories that consolidate the above positions by physi- 
cal attributes and function on the court: point guard, wings (shooting guards and small forwards), 
and bigs (power forwards and centers). However, current positions and player assignments to those 
positions may not fully reflect the variety of playing styles (Lutz, 2012). To classify players based 
on their playing style as reflected in their statistics. Lutz (2012) carried out a model-based cluster 
analysis of players based on their season statistics. We would like to take a different approach and, 
rather than assign players to strictly one playing style or identify clusters that are themselves mix- 
tures of more pure clusters, assume that players themselves demonstrate compositions of different 
pure playing styles. This assumption is intuitively plausible. For instance, the term “combo guard” 
is regularly used to describe a player who combines the skills and the playing style of a typical point 
guard and a typical shooting guard. As a result, we would like to use an individual-level mixture 
model for our analysis of the NBA data. 

To characterize players, we consider 13 different statistics from the 2010-11 NBA season avail- 
able on hoopdata.com (Hoopdata, 2012). Our dataset is composed of 332 players who had played 
30 or more games and averaged 10 or more minutes per game. We selected 13 statistics that char- 
acterize different elements of players’ styles; these largely overlap with the statistics used by Lutz 
(2012) in a model-based cluster analysis of similar data. 

The variables in our dataset include: minutes played per game, percent of made field goals 
that are assisted, assist rate, turnover rate, offensive rebound rate, defensive rebound rate, steals 
per 40 minutes, blocks per 40 minutes, and number of shots attempted per 40 minutes at each of 
the following locations: at the rim, from 3-9 feet, from 10-15 feet, from 16-23 feet, and beyond 
the 3-point line. All of the variables are continuous, but some, such as minutes played per game 
(maximum of 48) or percent of field goals made (0-100), are restricted in their range. 

In addition to these variables. Lutz (2012) also included the number of games played as another 
statistic in the cluster analysis. We elected not to use this variable as it is likely to be influenced by 
events such as injuries that may have little connection to a player’s style. Table 2. 1 lists the variables, 
their abbreviations, and formulas of calculated statistics in our dataset. 

Figures 2.1 and 2.2 display two bivariate scatterplots for selected player statistics. The data pat- 
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TABLE 2.1 

Variables, abbreviations, and formulas (if calculated). 


Variable 

Description and Formula 

Min 

Minutes played per game 

% Ast 

Percent of made field goals that are assisted tlel t otli > made a fieid goals'^ 

AR 

Assist Ratio r a 1^4 

FG A+ ( FTA X . 44 ) +Turnovers 

TOR 

Turnover Ratio Fr 4 ^™TSi°T° 

FG A+ ( FTA X . 44 ) +Turnovers 

ORR 

Offrn-ivr Prhnnnd Pntr 100x (Player ORebsx (Team Min/5)) 

Uttensive Kebound Kate (p layer Minx(TeamORebs+OppDRebs)) 

DRR 

Defensive Pehnnnd Pate 100 x (Player DRebs x (Team Min/5 ) ) 

Defensive Rebound Rate ( PlayerMinx ( TeamDRebs+0 ppORebs)) 

Rim 

Attempted field goals at the rim per 40 minutes 

Close 

Attempted field goals from 3-9 feet per 40 minutes 

Medium 

Attempted field goals from 10-15 feet per 40 minutes 

Long 

Attempted field goals from 16-23 feet per 40 minutes 

3s 

3-point field goals attempted per 40 minutes 

Stls 

Steals per 40 minutes 

Blks 

Blocks per 40 minutes 


terns presented in Figures 2.1 and 2.2 are typical of other bivariate scatterplots in this dataset (not 
shown). The shapes of the plotted points indicate the player’s position. In the list of positions in the 
legend, the positions ‘G’, ‘GF,’ and ‘F’ are listed in addition to the five main positions listed earlier. 
Hoopdata.com uses these designations in their positional assignments to describe players who reg- 
ularly play multiple positions. G (guard) is typically used to describe a player who plays both point 
guard and shooting guard, GF (guard-forward) to describe a player who plays both shooting guard 
and small forward, and F (forward) to describe a player who plays both small forward and power 
forward. 

Figure 2.1 plots the assist and turnover ratios of the players. The data appear to fan out from 
the lower left corner, adopting an almost triangular shape. Within this shape, we can see some 
patterns. Players designated as point guards and guards dominate the points in the upper right. 
Players manning the forward, power forward, and center positions generally appear to have low 
assist ratios and span the range of turnover ratios, comprising the points lining the left side of the 
plot. 

Figure 2.2 presents the corresponding plot for defensive rebound rate and 3-point field goals 
attempted per 40 minutes. We see a different pattern in this data with a clear cluster of players 
comprised of forwards, power forwards, and centers that rarely attempt 3-point field goals. Separate 
from this cluster is a cloudlike structure of points that tends to shoot some 3-pointers. Within the 
cloud, we see that point guards tend to have lower defensive rebound rates while forwards, power 
forwards, and centers tend to have higher rebound rates. 

Our first step with individual-level mixture modeling for these data is to identify which compo- 
sitional representation of continuous data is better suited for analyzing NBA player statistics. Next, 
we will present formulations of mixed membership and partial membership models and examine 
the data-generative capabilities of these models for both discrete and continuous data. 
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FIGURE 2.1 

Bivariate scatterplot of players’ assist ratio and turnover ratio. The symbols of the points represent 
the different positions of each player. 


2.3 Two Types of Membership 

In this section, we introduce the mixed membership and the partial membership models. For each 
of these two individual-level mixture models, we first consider a standard population-level mix- 
ture model formulation and then present each individual-level mixture model as a relaxation of 
the population-level mixture. Heller et al. (2008) used Bayesian methods to estimate the partial 
membership model. Similarly, Bayesian methods are frequently employed with mixed membership 
models. As we will see, the hierarchical Bayesian representations of the two models have many 
features in common. 

2.3.1 Mixed Membership Model 

Let y i be a vector of p outcomes for the /th individual or observational unit. We use K to denote the 
number of pure types or mixture components. Let Pk(-) specify the density particular to pure type 
fc, and let dj~ represent the parameters characterizing pt ( ■ ) for pure type k. The population-level 
mixture model with K components assumes the existence of K membership indicator variables, 
7 rjfc, for each individual i that designate the cluster or pure type to which the individual belongs. 



20 


Handbook of Mixed Membership Models and Its Applications 


10 . 0 - 


+ 

x 


5 7.5- 

§ 

a> 

CL 

*D 

a> 

Q. 

E 5.0- 

<D 

< 

(A 

_ 

$ 

■E 

o 

2.5- 


0.0- 




1 

-t£«- + 


. °£* + 
■s.r -t 

pi 


* + 
* 


v &>. 

eA~ 0 a x 

§J> V A A 

n.‘ oo-*a ^ 


'o *> 

A 0 * ^+ 

+ 




AAA^ + + 

-h. x+ xx X 

O v X~ XXX x f ^XX „ X X 

x *«* « * 

— I I I 

10 20 30 

Defensive Rebound Rate 


Pos 

PG 

G/SG/GF 
SF/F 
PF/C 



FIGURE 2.2 

Bivariate scatterplot of players’ defensive rebound rate and 3-point field goals attempted per 40 
minutes. The symbols of the points represent the different positions of each player. 


As such, 7 tik £ {0, 1} with the restriction 7^ = 1. The probability density for y *, given a 
collection of parameters 0 = (9 i, .... 9 k) for all I\ pure types and given the latent pure type 
membership indicator 7 for pure type k and individual i, is 

K 

p(y i |©,7Tj) = '^2tTikPk{yi\dk)- (2.1) 

k 

For the mixed membership model, one replaces tt,;. with a membership score go-. Instead of 
being restricted to either 0 or 1, the membership score g lk is allowed to range continuously between 
0 and 1, subject to the constraint g ik = 1. The mixed membership then takes the form 

J K 

p(y*l©,g I ) = Yl'^2gikPjk{yij\o jk )- (2.2) 

j k 

Here, conditional on the membership vector g, = {g,-\ . . . . , gix), the observations y, are assumed 
to be independent. 

In the Bayesian representation of the model introduced, g, ~ D g (a , p), where D g is a prior suit- 
able for compostional parameters and a, p are hyperparameters. As g, lies in the K — 1 probability 
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simplex, the most common choices for D g are the Dirichlet (Blei et al., 2003) and logistic normal 
(Blei and Lafferty, 2007) distributions. For the class-specific and outcome-specific parameters. Op-, 
a conjugate prior, is typically assumed: 


9jk ~ Conj(A, v), (2.3) 

where A, v are hyperparameters. The mixed membership model has a latent class representation 
that suggests a data augmentation approach for estimation (Erosheva, 2003). This approach adds an 
additional level of hierarchy to the model by including latent classification variables. 

2.3.2 Partial Membership Model 

An alternative means of specifying Equation (2.1) is through the product of the densities: 

K 

p(yi|0,7r) = nP*(y<l»*r fc - ( 2 - 4 ) 

k 

We specify the partial membership model by relaxing Equation (2.4) so that 

1 K 

p(y»l©,g) = - ~[[pk(y t \0k) 9ik , (2.5) 

c V 

where £ [0, 1] and c is a normalizing constant. Heller et al. (2008) further highlights the case 
where pr is an exponential family density (denoted Exp(-)): 

Pfc(yilV’fe) =Exp(^ fc ). (2.6) 

Here, ifk denotes the natural parameters for pure type fc. Let \t f denote the collection of the natural 
parameters for all pure types. 

Substituting exponential family densities for p *. in Equation (2.5), we obtain 

p(y., |'lF g) = Exp [^gk'f’k j • (2.7) 

In addition, let the natural parameters for each pine type follow a conjugate prior distribution 

f>k ~ Conj(A, v), (2.8) 

where A, v are hyperparameters. As with the mixed membership model, we assume g,; ~ I)g ( a , p) 
where D g is a prior suitable for compostional parameters and a , p are hyperparameters. 

Conditional on the membership scores, y, is distributed according to the same exponential fam- 
ily distribution as the pure types but with natural parameters that are a convex combination of the 
natural parameters of the pure type distributions. The use of the exponential family distributions 
allows one to model a variety of outcome types. Going forward, we focus on this particular case of 
the Bayesian partial membership model. 


2.4 Comparison of Partial and Mixed Membership 

In this section, we compare and contrast the partial membership model with the mixed member- 
ship model using simulated data. Figure 2.3 provides a graphical comparison of the models’ data 
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generative processes. Although the generative structures are otherwise very similar, we see that 
whereas the mixed membership model assumes local independence (i.e., the outcomes are condi- 
tionally independent given the pure type memberships), the partial membership model makes no 
such assumption. What we can not see in Figure 2.3 is how the pure type parameters and member- 
ship scores are combined together mathematically to define the individual-level distributions of the 
outcomes. To explore this, we examine scatterplots for data generated by the two types of models 
when the data are continuous and probabilities of success generated by the respective models when 
the data are binary. Understanding the differences in the continuous data case will help us select an 
appropriate model for the NBA player data introduced in Section 2.2. 




FIGURE 2.3 

Graphical representations of the mixed membership (left) and partial membership models (right). 


2.4.1 Continuous Data 

For both the partial and mixed membership models, we begin by assuming that the pure type den- 
sities are normal. We allow the means of the normal distributions to vary by pure type. Similar to 
model-based clustering with the mixtures of normals (Fraley et al., 2012), different specifications 
are possible for the variances. For this work, we will focus on two cases. In the first case, we con- 
sider variance specifications to be the same across the pure types. In the second case, we allow the 
variances to differ across pure types. While the mixed membership model uses a local independence 
assumption, the partial membership model does not. Hence, for the partial membership model, we 
additionally consider two cases of variance specification: the case where the outcomes are corre- 
lated, conditional on the membership scores, and the case where the outcomes are uncorrelated, 
conditional on the membership scores. Next, we specify mixed membership and partial member- 
ship models for continuous data with normally distributed pure types before examining scatterplots 
of simulated data under the different scenarios of variance specification. 

Mixed Membership 

Under the mixed membership model and a local independence assumption, each outcome y i7 , con- 
ditional on the pure type memberships for individual i, is distributed 

Vij\Si, © ~ (Vjk,Vj k ) ■ 

k 


(2.9) 
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If we restrict o 2 k = crj so that the variances do not differ by pure type, then the model formu- 
lation remains the same. As we will see, in the case of the partial membership model, the model 
formulation can be simplified under the same restriction. 


Partial Membership 


In the case of the partial membership model, we may assume multivariate normal densities as we 
are not restricted to the conditional independence assumption. The observed data for individual i, 
y, will also be multivariate normally distributed conditional on the pure type membership for indi- 
vidual i and the pure type parameters (recall Equation 2.7). The natural parameters of a multivariate 
normal distribution are X “ 1 // and — ^X -1 , where // and 2 are the mean and covariance matrix 
of a multivariate normal distribution. Let © = { . X/, . k = 1, . . . , K } denote the collection of 
pure type means and covariance matrices. As a result, the natural parameters of p(y;|g,:, ©) are 
Yhk9ik^k Pk ar, d ~ 2 Sfc 9ik^k ■ 

Using the standard parameterization for the multivariate normal distribution, the vector of ob- 
served data, y. ( , is conditionally distributed 






If we restrict Xi = • • • = Dk = E, then 



( 2 . 10 ) 


y»|gi>© 


N 



( 2 . 11 ) 


Finally, if we assume the outcomes y t are conditionally independent given the pure type mem- 
berships (local independence), each outcome conditional on the pure type memberships for 
individual i is distributed 


Uij I g* ) ® 


-1 


N 


9ik<r jk 




where a 2 - k is the j-th diagonal element of X fe , now a diagonal matrix. 


( 2 . 12 ) 


Simulated Data Scenarios 

We now compare data generated by each of the two models. Consider three pure types with two 
normally distributed outcomes. We present the means for each pine type in Table 2.2. 

For the variance specifications, we explore two scenarios, one where the variances for each out- 
come are the same across pure types and a second where the variances differ across pure types. For 
each scenario, we consider three models: a mixed membership with a local independence assump- 
tion, a partial membership with a local independence assumption, and a partial membership model 
with no restrictions on dependence. 

Table 2.3 summarizes Scenario 1 for which we assume the variance for the first outcome is 4 for 
all pure types and 9 for the second outcome for all pure types. Because of the local independence 
assumption used in the mixed membership model, there is no correlation between the two outcomes. 
As a means of comparison, we consider a corresponding partial membership model that employs 
the local independence assumption and hence also has the correlation between the two outcomes 
restricted to 0. Finally, the partial membership model without a local independence assumption 
assumes a correlation of 0.4. 

Table 2.4 presents the corresponding information for Scenario 2 where the covariances may vary 
by pure type. For the partial membership model with full covariance matrix, the correlations by pine 
type were set to 0.4, -0.4, and 0.7. 
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TABLE 2.2 

Pure Type Means. 



Pure Type 

Outcome 

12 3 

1 

10 25 40 

2 

25 40 10 


TABLE 2.3 

Covariance matrices under Scenario 1 . 

Model 

Mixed Membership 

Partial Membership (Uncorrelated) 

Partial Membership (Correlated) 


Pure Types 

FF" 

lo 9 J 

1 4 0 1 

v° 9 / 

( 4 2.4 

^2.4 9 


1-3 


TABLE 2.4 

Covariance matrices under Scenario 2. 
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Under Scenario 1, we assume the pure type covariances are common across all three pure types. We 
keep the population parameters constant and vary the distribution of membership scores to produce 
scatterplots of observed data. 

We generated 1000 random membership vectors from a Dirichlet ( ap ) distribution with a = 1 
and p = (1/3, 1/3, 1/3). Using these membership scores, we simulated 1000 bivariate outcomes. 
The results are depicted in Figure 2.4(b). The left plot shows the mixed membership model and the 
center plot displays the corresponding partial membership model with a diagonal covariance matrix 
(i.e., local independence was assumed as in the case of the mixed membership model). The right 
plot shows partial membership model results with a full covariance matrix where the variances of 
the outcomes are the same as the previous two cases but the correlation between the outcomes is set 
to 0.4. 

In Figure 2.4(b), the mixed membership model generates points in three columns. Looking more 
closely, each column can be divided horizontally into three parts corresponding to the means for 
each pure type for y , :l . Dividing the columns in this manner produces K 2 = 9 clusters of points, 
consistent with the latent class representation described by Erosheva (2006) and the more extreme 
depiction presented in Figure 4 in Heller et al. (2008). The partial membership model, in both the 
diagonal and full covariance matrix cases, generates points in a more cloud-like structure. One can 
see that the partial membership model with the full covariance matrix generates a set of points that 
is “rotated,” albeit slightly, as compared to the set generated by the partial membership model with 
a diagonal covariance matrix. 

By varying the values of a, we can further compare the models. If we set a = 10, the membership 
scores will fluctuate more closely around 1/3 than a = 1. Figure 2.4(c) presents 1000 generated 
data points with membership scores generated from a Dirichlet (ap) distribution with a = 10 and 
p = (1/3, 1/3, 1/3). In the case of the mixed membership model, the K 2 clusters become slightly 
more apparent while the data generated by the partial membership models reduce to single clusters 
with less variation. If we set a = 1/10, the membership scores tend to be closer to the extremes 
0 or 1. Figure 2.4(a) presents the simulated data from each model with this set of membership 
scores. The three plots now appear largely similar. The primary differences are that the set of points 
generated by the partial membership model with full covariance matrix is “rotated” as compared to 
the other two and that the mixed membership model appears to show greater variation in points on 
the periphery. 

Scenario 2: Different Variances Across Pure Types 

We subsequently generate data points from each individual-level mixture model according to Sce- 
nario 2. Again, the pure type covariances for Scenario 2 are listed in Table 2.4. Figure 2.5(b) presents 
the data generated by the mixed membership model, the partial membership model with diagonal 
covariance, and the partial membership model with unrestricted covariance for a = 1. The sets of 
points generated by the mixed membership and partial membership model with diagonal covariance 
appear rectangular in shape. The set of points from the partial membership model with diagonal 
covariance is more densely populated in the center while one can faintly make the clusters in the 
set of points generated by the mixed membership model. The partial membership model with full 
covariance matrices on the other hand is more triangular in structure. 

Figures 2.5(c) and 2.5(a) provide the corresponding plots for membership vectors generated by 
a = 10 and a = 1/10, respectively. With a = 10, we again see the greater concentration of points 
into a single cluster for the partial membership models while the different clusters become a little 
more apparent for the mixed membership model. In the case of a = 1/10, the mixed membership 
and partial membership with diagonal covariance models again appear very similar. The full covari- 
ance partial membership model, however, displays a triangular boundary with an empty center. 

Overall, while the mixed and partial membership models can produce scatterplots that look 
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(c) a = 10 


FIGURE 2.4 

Simulated data according to different individual-level mixture models assuming variances are the 
same across pure types. Each panel contains: mixed membership (left), partial membership with 
local independence assumption (center), partial membership with full covariance matrix (right). 
The solid points represent the pure type centers and the dashed ellipses represent 2SD contours. 
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(a) a -0.1 



(c) a = 10 


FIGURE 2.5 

Simulated data according to different individual-level mixture models for the case where the vari- 
ances are different across pure types. Each panel contains: mixed membership (left), partial mem- 
bership with local independence assumption (center), partial membership with full covariance ma- 
trix (right). The solid points represent the pure type centers and the dashed ellipses represent 2SD 
contours. 
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very similar for some special cases of the distribution of the membership scores, we observe that 
the partial membership models generate scatterplots that are more contiguous. At the same time, 
we emphasize that placements of pure type means and variances, as well as the selection of the 
distribution of the membership scores, can create different patterns that would not be as easy to 
recognize as either mixed or partial membership. To investigate this further, one could consider a 
template for variance specification as provided by model-based clustering with Gaussian clusters 
(Fraley et al., 2012). 


2.4.2 Binary Data 

We now examine the mixed and partial membership models for binary data. We follow a geometric 
approach (Erosheva, 2005) where we keep the population parameters constant and examine popu- 
lation heterogeneity manifolds obtained by letting subject-level parameters vary over their natural 
range. 

In the case of binary data, we compare the models by examining the probability of a positive re- 
sponse, p{yij = 1 1 g, . ©), for outcome j and individual i, conditional on the pure type membership 
of individual i. Let Op- denote the probability of a positive response for pure type k and outcome j. 
Then, 

Oij = p(yij = l|g, ©) = 9ik0jk, (2.13) 

k 


so that Dij\ g, © has a Bernoulli distribution where the probability of a positive response is a 
weighted arithmetic mean of the pure type response probabilities. 

In the case of the partial membership model, y. (J jg, © also has a Bernoulli distribution 
but where the natural parameter is a convex combination of the pure type natural parameters, 
Efc - Ojk )]- As a result. 


= P{Vij = l|g,©) = 


Ylk e % k 


ri/, + n*(i - Oik )** ' 


(2.14) 


In the case of the partial membership model, the probability of a positive response (Equation 2.14) 
is a normalized weighted geometric mean of the pure type response probabilities. 

We now examine how these differences in the mixed membership and partial membership mod- 
els for binary data manifest themselves for different pure type membership and parameter values. 
We consider K = 2 pure types and p = 2 outcomes. Let g, denote the degree of membership for 
an arbitrary individual in the first pme type; the degree of membership in the second pure type is 
then 1 — g,. We examine dp, the marginal probability of a positive response for outcome j , and 
individual i given by Equations (2.13) and (2.14) for the two types of models, respectively. 


TABLE 2.5 

Pure type response probabilities. 


Scenario 

Ou 

0 l 2 

O21 

O22 

1 

0.1 

0.8 

0.3 

0.6 

2 

0.05 

0.8 

0.3 

0.95 

3 

0.01 

0.8 

0.3 

0.99 

4 

0.001 

0.8 

0.3 

0.999 

5 

0 

0.8 

0.3 

1 


Table 2.5 presents five sets of the pure type response probabilities, 9jk', the corresponding 
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FIGURE 2.6 

Marginal probability plots for Scenarios 1-5 in Table 2.5 obtained with the partial membership 
model (darker) and the mixed membership model (lighter). 


marginal probability plots appear in Figure 2.6. Treating the pure type response probabilities as 
constant, we examine population heterogeneity manifolds obtained by letting membership scores gi 
vary over their natural range from 0 to 1 . The darker points indicate the population heterogeneity 
manifolds obtained with the partial membership model for given On and 9^, whereas the lighter 
points indicate the corresponding manifolds for the mixed membership model. 

For Scenario 1, we see that the heterogeneity manifold for the partial membership model is a 
nonlinear path that closely resembles the heterogeneity manifold for the mixed membership model. 
As the pure type response probability On decreases and O 22 increases over the five scenarios, the 
paths of points increasingly diverge. Finally, for Scenario 5, the partial membership model produces 
the heterogeneity manifold that takes only three pairs of values, sitting at the comers of the marginal 
probability space. At gi = 0 and gi = 1, the partial membership model produces 0,j values equiv- 
alent to the mixed membership model. For values of g j, 0 < g, < 1, in Scenario 5, 0, \ = 0, and 
6i 2 = 1 under the partial membership model. Consistent with the geometric mean representation in 
Equation (2. 14), in scenarios where one of the pure type conditional response probabilities equals 1, 
any partial membership in the pure type implies that that individual’s probability for that outcome 
must be 1. Similarly, when one of the pure type conditional response probabilities equals 0, any 
partial membership in that pure type implies that the probability for that outcome must be 0. We 
do not observe this property in the mixed membership model that employs the arithmetic mean to 
derive individual-specific marginal probabilities (Equation 2.13). Moreover, as one of the pure type 
probabilities decreases to 0 or increases to 1, the population heterogeneity manifolds obtained under 
the partial membership and mixed membership models increasingly diverge, as shown in Figure 2.6. 

Overall, we have demonstrated that the partial and mixed membership models exhibit different 
data-generating behavior. In the case of continuous data, the partial membership model generates 
data in more contiguous patterns that may be more natural for some applications. However, except 
for some special cases, it may not be possible to tell the nature of individual-level mixing from 
scatterplots. Hence, data mechanisms need to be considered. 

Our decision for the analysis of the NBA data is to use a partial membership model. We be- 
lieve that a partial membership model could better describe the types of data patterns displayed in 
Figures 2.1 and 2.2 than a mixed membership model. However, an equally important factor in our 
decision is the nature of individual-level mixing in the data. The NBA player data contain variables 
that themselves are summary statistics as opposed to individual player’s actions. While mixed mem- 
bership modeling should be more appropriate for the latter type of data that could exhibit changes 
in (latent) pure type assignments for each variable, we find the partial membership representation to 



30 


Handbook of Mixed Membership Models and Its Applications 


be more consistent with the averages reported over an NBA season. These considerations are akin 
to the switching and blending interpretations discussed in Galyardt (2014). 


2.5 A Correlated Partial Membership Model for Continuous Data 

Before analyzing the NBA player style data with a partial membership model for continuous data, 
we develop an extension of the partial membership model that allows for correlated membership 
scores. We subsequently discuss estimation of the correlated partial membership model. 

2.5.1 Correlated Memberships 

One limitation of the partial membership model as originally formulated is its inability to flexibly 
accommodate correlations among an individual’s membership in the pure types. The Dirichlet prior 
induces a small negative correlation among the pure type memberships in individuals. Blei and 
Lafferty (2007) addressed this shortcoming in mixed membership topic models by replacing the 
Dirichlet prior for individual membership scores with a logistic normal prior. Under this model, 
draws from the multivariate normal are transformed to map the probability simplex so that the 
values are positive and constrained to add to 1, 

Vs, ~N (p,E), 

„ _ ex P07g,J 

yik v— r / \ 

2^i ex P vlgii) 

Because of the constraints that <j,k = 1, we fix the A'th element of r/ g to 0 so that the vector 
contains only K — 1 free elements and p and S have dimensions K — 1 and (A' — 1) x (AT — 1), 
respectively. Atchison and Shen (1980) discuss properties and uses of the logistic normal, including 
a comparison with the Dirichlet distribution. They suggest that the logistic normal can suitably 
approximate the Dirichlet distribution so that little, if anything, would be lost if we applied the 
logistic normal in cases where a Dirichlet prior would be appropriate. 

2.5.2 A Correlated Partial Membership Model 

To model the continuous data in the NBA example, we assume the observed data points for indi- 
vidual i, y, are conditionally independent given the pure type memberships for the individual, gj. 
Equation (2.9) gives the distribution of ijjj under this assumptions. Now let Tjk = cr~ k anc ^ ^ et 
a.jk = a~ k p jk in Equation (2.9) so that Tjk and < f>jk correspond closely to the natural parameters 
of a normal distribution. Moreover, let 0 = {gik, 4>jk,Tjk, Pk,j = 1 , ,J,k = 1, . . . , A'}. 

For a.jk and Tjk, we specify normal and gamma prior distributions, respectively. The elements of 
the mean vector of the untransformed pure type memberships, pk, are also specified to have normal 
prior distributions. For the covariance matrix for the untransformed pure type memberships, S, we 
use an inverse Wishart prior distribution. Fully stated, the correlated partial membership model for 
continuous data is 


(2.15) 

(2.16) 


Uij I Si ) ® 


N 



a jk 


N 






(2.17) 

(2.18) 
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T jk 

~ Gamma (o Tjk , 4 > Tjk ) , 

(2.19) 

9ik 

ex P {Vg ik ) 

E;exp(%J’ 

(2.20) 

Vs, 

~ N (p, S) , 

(2.21) 

Pk 

~ N ( m Pk ,s pk ) , 

(2.22) 

E 

~ Inv. Wishart (Vs, 5 e) . 

(2.23) 


In order to obtain posterior samples of p-p- and a 2 k rather than ctjk and Tjk, we may transform the 
posterior samples of ajk and Tjk to fijk and <j' 2 - k . 



II n ( 2 na l ik ) 7 exp ( - {<*jk - rn a . k ) 2 j (2.24) 

n n f7^r r ? fe ” 1 ex p 

j k 1 {"Tjk) 

n ( 27r ) _(A_1)/2 |S|- 1/2 exp ( rji - p) T S- 1 (rji - p)) 

i x ’ 

II ( 27 rs pJ” 1 / 2 ex p(-^(pfc- TO pJ 2 ) ■ 


As Heller et al. (2008) noted, all of the parameters in © are continuous and, moreover, we may take 
the derivatives of the log of the above probability expression. As a result, the problem of Bayesian 
estimation for this model lends itself to Hybrid (Hamiltonian) Monte Carlo. Hybrid Monte Carlo 
uses the derivative of the log joint probability to inform its proposals . As a result, in high dimensions, 
this algorithm may outperform more traditional algorithms such as Metropolis-Hastings or Gibbs 
sampling. For a thorough introduction to Hybrid Monte Carlo, see Neal (2010). In order to avoid the 
imposition of non-negativity restrictions on in the Hybrid Monte Carlo algorithm, we employ 
the transformation rj Tjk = log (Tjk) so that the parameter may take values unrestricted over the real 
line. 

We do not rely on Hybrid Monte Carlo to draw S but rather draw S in a separate Gibbs step 
for the correlated partial membership model. Thus, to sample (0, S), we apply a Gibbs sampling 
algorithm where the first step involves sampling 0 via Hybrid Monte Carlo and then S from its full 
conditional distribution, 

E ~ Inv. Wishart (V s + n, + (H g - 1 „p T ) T (H g - 1 „p T )) , (2.25) 

where H G is a n x K — 1 matrix of the untransformed membership scores. 



32 


Handbook of Mixed Membership Models and Its Applications 


2.6 Application to the NBA Player Data 

We now apply the correlated membership model to NBA player data from the 2010-1 1 season. We 
considered models with 4, 5, and 6 pure types. We employed posterior predictive model checks to 
examine the fit of the model-based marginal distributions and rank correlations to the observed data. 
We ultimately settled on a model with 5 pure types as this model had the smallest number of classes 
that still provided sufficient fit to the data. The 5 pure type correlated partial membership model 
resulted in easily interpretable classes from a substantive viewpoint. Also, each pme type had at 
least one membership score above 0.20, meaning that at least one player had 1/5 or more of their 
membership in that type. 

We ran the Gibbs sampling algorithm with a Hybrid Monte Carlo step for 80,000 iterations, 
keeping every 20th draw. We discarded the first 1000 of the retained draws as burn-in, leaving us 
with 3000 samples from the posterior distribution. To asses convergence, we examined trace plots 
and used the Geweke (Geweke, 1992) and Raftery-Lewis (Raftery and Lewis, 1995) diagnostic tests. 

In examining the posterior estimates for the pure type specific means, p : jk, presented in Ta- 
ble 2.6, we notice that some of the posterior means take negative values when all of the statistics 
recorded are strictly positive. For example, in the case of the % Ast statistic (the percentage of made 
field goals that are assisted), the range of the data is [0, 100], yet only one of the estimated pure type 
means lies inside this range. This observation is not worrisome by itself as it could be that no indi- 
vidual has high membership values in the pure types with negative means. We are more concerned 
with the associated predictive distributions for the observed data that are directly related to model 
fit. Nonetheless, when a pure type is characterized by values outside the range of observed data, the 
interpretation of this pure type is more complicated than of those pure types that can in principle be 
achievable in the population. 

Figure 2.7 presents a posterior predictive model check that compares the marginal distribution 
of the percent of made fields goals assisted (% Ast) statistic against the replicated values for the 
statistic. The histogram depicts the observed data while the black points represent the posterior 
predictive mean count of replicated values falling in the corresponding bin. The black segment 
represents the 95% credible interval. We observe in Figure 2.7 that the model fits the marginal 
distribution of the data well; we obtained similar findings for other variables (not shown). 

Although the model provides a good fit to the observed data, the shortcoming of this model 
is that it still places (small) non-zero predictive density in the improbable region of the data. This 
shortcoming will naturally arise when we use a normal distribution to model range-restriced data. 

Examining the ordering of the posterior means can provide us with a way to characterize the 
pme types in relation to one another. Table 2.6 illustrates that pure type 1 comprises players who 
play a high number of minutes (Min), have a high percentage of their shots assisted (% Ast), shoot 
mid- and long-range jumpers (Medium, Long, 3s), and have a low steals rate (Stls). We refer to 
this pure type as the “high minute shooters.” A high percentage of shots assisted (% Ast) and high 
volume of 3-point shots (3s) also describes pure type 2, but members of this pure type have fewer 
shots at all other distances (Rim, Close, Medium, Long) and a lower number of minutes played. 
We refer to this pine type as the “3-point specialists.” The posterior means for the 3rd pure type are 
high relative to those for the other pme types across almost all variables except for the the mid- to 
long-range jumpers (Medium, Long, 3s). We use the term “active player” for this pure type. Low 
minutes played (Min) and high offensive rebound rates (ORR) are the most distinguishing features 
of pure type 4 which we refer to as the “limited big men” pure type. High assist (AR) and turnover 
(TOR) ratios, high steals per 40 minutes, a low percentage of shots assisted and low blocked shots 
per 40 minutes mark the final pme type. We refer to this pure type as the “ball handlers” pure type. 

Figure 2.8 presents the mean posterior memberships of the players in these different pme types. 
The points’ symbols denote their assigned position recorded in the original dataset. Here, we can see 
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TABLE 2.6 

Posterior means for pure type mean parameters, 


Var. 

1 

Pure Type 

2 3 

4 

5 

Min 

42.40 

13.36 

416.04 

18.09 

29.43 

% Ast 

132.09 

108.08 

-179.33 

70.81 

-12.47 

AR 

4.36 

131.68 

916.56 

9.20 

607.91 

TOR 

6.55 

-132.22 

210.90 

16.96 

194.49 

ORR 

-29.30 

-5.90 

1552.00 

9.34 

1.70 

DRR 

228.25 

7.51 

1077.20 

17.26 

8.01 

Rim 

0.26 

-0.64 

116.07 

3.73 

6.03 

Close 

14.73 

-0.13 

238.88 

1.38 

2.07 

Medium 

340.24 

0.08 

10.87 

1.17 

3.35 

Long 

77.21 

0.72 

-33.99 

3.05 

3.81 

3s 

10.80 

15.21 

22.61 

0.02 

2.14 

Stls 

-3.51 

0.96 

54.93 

0.81 

1.82 

Blks 

12.81 

0.25 

61.76 

1.84 

0.21 


that high membership in some pure types corresponds to certain positional assignments. Thus, the 
highest memberships in the limited big men pure type (pure type 4) are obtained by centers (C) and 
power forwards (PF) while the ball handlers pure type (pure type 5) is dominated by point guards 
(PG). Membership in pure types 1-3 does not have a close correspondence with specific assigned 
positions. For pure types 1-3, and to a lesser extent for pure type 5, no players come close to being 
fully represented by the pure type. This explains why the model performs well for predicting the 
marginal probability for the % Ast outcome despite having posterior means for pure types 1-3 and 
5 to be out of bounds on that variable. 

In contrast to the original partial membership model with Dirichlet membership scores (Heller 
et al., 2008), the correlated partial membership model allows for a more flexible correlation structure 
among components of the membership vector. Table 2.7 presents the posterior mean correlations of 
the pure type memberships that range from -0.664 to 0.410. The limited big man pure type (pure 
type 4) shows low to moderate negative correlations with all other pure types. The active player pure 
type, on the other hand, shows low to moderate positive correlations with the high minute shooter 
and 3-point specialist pure types and small negative correlations with the limited big man and ball 
handler pure types. We note that it is impossible to observe positive correlations under the Dirichlet 
type models. This suggests that our decision to allow for more flexible modeling of the pure type 
membership correlations was appropriate for the data. 


TABLE 2.7 

Posterior mean correlations of membership scores. 



1 

2 

3 

4 

5 

1 

1.000 

0.081 

0.410 

-0.528 

0.160 

2 

0.081 

1.000 

0.223 

-0.553 

-0.080 

3 

0.410 

0.223 

1.000 

-0.235 

-0.328 

4 

-0.528 

-0.553 

-0.235 

1.000 

-0.664 

5 

0.160 

-0.080 

-0.328 

-0.664 

1.000 
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FIGURE 2.7 

Histogram of the observed values for the % Ast statistic. The black points indicate the mean count 
across replicated datasets for each score. The black vertical segment indicates the interval from the 
2.5% to 97.5% quantiles across replicated datasets. 



Class 


FIGURE 2.8 

The mean posterior memberships of the players by pure type. The shapes of the points represent the 
different positions of each player. 
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To explore the compositional styles of NBA players further, consider the posterior mean mem- 
bership scores and the corresponding credible intervals for three NBA “combo guards”: Mario 
Chalmers, Steve Blake, and Rudy Fernandez, as identified by Lutz (2012). As a point of contrast, 
we examine the corresponding quantities for Chris Paul, who is generally considered to be an ex- 
ample of a pure point guard (Figure 2.9). We observe that 80% of Chris Paul’s membership is in 
the ball handlers pure type. For the other three players, their membership is largely split between 
the ball handlers pure type and the 3-point specialists. Thus, we see that the correlated partial mem- 
bership model describes the combo guard players using a mixture of pure types. This result stands 
in contrast to the results of the cluster analysis performed by Lutz (2012), where the combo guards 
comprised their own cluster, entirely separate from the other 12 clusters found in that analysis. Our 
correlated partial membership model uses only 5 pure types but characterizes the heterogeneity in 
individual playing styles as combinations of these pure types. 



Class 


Player 

Chris Paul 

-A- Mario Chalmers 
Rudy Fernandez 
— f— Steve Blake 


FIGURE 2.9 

The mean posterior memberships and 95% posterior credible intervals of Chris Paul, Mario 
Chalmers, Rudy Fernandez, and Steve Blake. The grey points represent the posterior mean mem- 
berships of the other players in the data. 


2.7 Summary and Discussion 

In this chapter, we explored two individual-level mixture models for latent compositional data, 
namely, the mixed and partial memberships models. We found that the partial membership model 
has better potential for producing realistic representations of contiguous data patterns. However, we 
note that high-dimensional multivariate distributions of real data typically present even more com- 
plexity than the simulated examples considered here, which could easily mask the soft clustering 
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nature of the underlying process. In such cases, one should consider a plausible interpretation for 
the latent compositional data at hand. For example, we point out that the partial membership for- 
mulation is consistent with the blending interpretation of mixed membership models as proposed 
by Galyardt (2014), because the NBA player dataset is primarily composed of continuous summary 
statistics. By contrast, in the binary data case, depending on the placement of pure type response 
probabilities, we observe that the partial membership model may result in a very particular behav- 
ior where fewer outcome combinations are possible compared to the Grade of Membership model. 
The implication of this finding for individual-level mixture models with binary data is that partial 
membership may not be appropriate for all binary data cases. 

We modified the partial membership model to incorporate a logistic normal distribution for pure 
type memberships, similar to the correlated topic model extension (Blei and Lafferty, 2007) of 
the latent Dirichlet allocation models (Blei et al., 2003). This approach gave us more flexibility in 
specifying the dependence structure among the pure type memberships. We have illustrated the use 
of a partial membership model on continuous data using NBA player statistics. The NBA dataset 
provided an illustrative example where pure type membership scores exhibited both negative and 
positive correlations. We note that it is not possible to obtain positive correlations when one employs 
a Dirichlet distribution for the membership scores. 

Although our partial membership analysis of the NBA player data resulted in a good fit as 
measured by the posterior predictive model checks, the limitation of using Gaussian pure type dis- 
tributions is that the predicted values may lie outside of the allowable data intervals for variables 
that are constrained in their range. While it may be possible to specify other distributions for the 
pure types that can produce suitably constrained predicted values, a more general semiparametric 
approach that can accommodate not only range-restricted variables but also mixed data with both 
discrete and continuous outcomes could be more beneficial going forward (Gruhl et al., 2013). Ex- 
amples of mixed outcome data are increasingly common in medicine and the social sciences, and 
the development of individual-level mixture models could be helpful for characterizing patterns in 
multivariate mixed outcomes. 
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The original three mixed membership models all analyze categorical data. In this special case there 
are two equivalent interpretations of what it means for an observation to have mixed membership. 
Individuals with mixed membership in multiple profiles may be considered to be ‘between’ the pro- 
files, or they can be interpreted as ‘switching’ between the profiles. In other variations of mixed 
membership, the between interpretation is inappropriate. This chapter clarifies the distinction be- 
tween the two interpretations and characterizes the conditions for each interpretation. I present a 
series of examples that illustrate each interpretation and demonstrate the implications for model 
fit. The most counterintuitive result may be that no change in the distribution of the membership 
parameter will allow for a between interpretation. 


3.1 Introduction 

The idea of mixed membership is a simple, intuitive idea. Individuals in a population may belong to 
multiple subpopulations, not just a single class. A news article may address multiple topics rather 
than fitting neatly in a single category (Blei et al., 2003). Patients sometimes get multiple diseases 
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at the same time (Woodbury et al., 1978). An individual may have genetic heritage from multiple 
subgroups (Pritchard et al., 2000; Shringarpure, 2012). Children may use multiple strategies in 
mathematics problems rather than sticking to a single strategy (Galyardt, 2012). 

The problem of how to turn this intuitive idea into an explicit probability model was originally 
solved by Woodbury et al. (1978) and later independently by Pritchard et al. (2000) and Blei et al. 
(2003). Erosheva (2002) and Erosheva et al. (2004) then built a general mixed membership frame- 
work to incorporate all three of these models. 

Erosheva (2002) and Erosheva et al. (2007) also showed that every mixed membership model 
has an equivalent finite mixture model representation. The proof in Erosheva (2002) shows that the 
relationship holds for categorical data; Erosheva et al. (2007) indicates that the same result holds in 
general. 

The behavior of mixed membership models is best understood in the context of this represen- 
tation theorem. The shape of data distributions, the difference between categorical and continuous 
data, possible interpretations, and identifiability all flow from the finite mixture representation (Gal- 
yardt, 2012). This chapter describes the general mixed membership model and then explores the 
implications of Erosheva’s representation theorem. 


3.2 The Mixed Membership Model 

Due to the history of mixed membership models, and the fact that they were independently devel- 
oped multiple times, there are now two common and equivalent ways to define mixed membership 
models. The generative model popularized by Blei et al. (2003) is more intuitive so we will discuss 
it first, followed by the the general model (Erosheva, 2002; Erosheva et al., 2004). 

3.2.1 The Generative Process 

The generative version of mixed membership is the more common representation in the machine 
learning community. This is due largely to the popularity of latent Dirichlet allocation (LDA) (Blei 
et al., 2003), which currently has almost 5000 citations according to Google Scholar. LDA has 
inspired a wide variety of mixed membership models, e.g., see Fei-Fei and Perona (2005), Girolami 
and Kaban (2005), and Shan and Banerjee (2011), though these models still fit within the general 
mixed membership model of Erosheva (2002) and Erosheva et al. (2004). 

The foundation of the mixed membership model is the assumption that the population consists 
of I\ profiles, indexed k = 1, . . . , K, and that each individual i = 1 ..... A belongs to the profiles 
in different degrees. If the population is a corpus of documents, then the profiles may represent the 
topics in the documents. If we are considering the genetic makeup of a population of birds, then 
the profiles may represent the original populations that have melded into the current population. In 
image analysis, the profiles may represent the different categories of objects or components in the 
images, such as mountain, water, car, etc. When modeling the different strategies that students use 
to solve problems, each profile can represent a different strategy. 

Each individual has a membership vector, 6i = {On, ■ ■ ■ , Oik), that indicates the degree to which 
they belong to each profile. The term individual here simply refers to a member of the population 
and could refer to an image, document, gene, person, etc. The components of 6 are non-negative and 
sum to 1, so that 6 can be treated as a probability vector. For example, if student i used strategies 
1 and 2, each about half the time, then this student would have a membership vector of Qi = 
(0.5, 0.5, 0, ..., 0). Similarly, if an image was 40% water and 60% mountain then this would be 
indicated by 

Each observed variable Xj, j = 1 has a different probability distribution within 
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each profile. For example, in an image processing application, the water profile has a different 
distribution of features than the mountain profile. In another application, such as an assessment 
of student learning, different strategies may result in different response times on different prob- 
lems. Note that X 3 may be univariate or be multidimensional itself, and that we may observe 
r = 1, . . . ,Rij replications of X 3 for each individual i, denoted X. i;r . The distribution of X 3 
within profile k is given by the cumulative distribution function (cdf) Fj-j. 

We introduce the indicator vector Z 1jr to signify which profile individual i followed for replica- 
tion r of the jth variable. For example, in textual analysis, Z, 3T would indicate which topic the /'th 
word in document i came from. In genetics, Z ljr indicates which founding population individual i 
inherited the /’th copy of their jth allele from. 

The membership vector 0, indicates how much each individual belongs to each profile so that 
Zij r ~ Multinomial (6 i). We will write Z,j r in the form that, if individual i followed profile k for 
replication r of variable j, then Z, rr = k. The distribution of X ljr given Z,. ]r is then 

Xijr\ z ijr = k ~ F k j. (3.1) 

The full data generating process for individual i is then given by: 

1. Draw Oi ~ D(0). 

2. For each variable j = 1 , ,J: 

(a) For each replication r = 1, . . . , R ,, 3 : 

i. Draw a profile Zi 3r ~ Multinomial {9 )). 

ii. Draw an observation X. i?r ~ Fz ijr ,j( x j) from the distribution of X 3 associated with 
the profile Z ljr . 


3.2.2 General Mixed Membership Model 

The general mixed membership model (MMM) makes explicit the assumptions that are tacit within 
the general model. These assumptions are collected into four layers of assumptions: population 
level, subject level, sampling scheme, and latent variable level. 

The population level assumptions are that there are K different profiles within the population, 
and each has a different probability distribution for the observed variables F k j. 

The subject level assumptions begin with the individual membership parameter 0, that indicates 
which profiles individual i belongs to. We then assume that the conditional distribution of X, 3 given 
Oi is: 

K 

F(xj\0i) = T^Pr(Z ljrk = l\0i)F(xj\Zi jrk = 1), (3.2) 

k= 1 

K 

= yO lk F kj ( Xj ). (3.3) 

k= 1 

Equation (3.3) is the result of combining Steps 2(a)i and 2(a)ii in the generative process. Z ljr 
is simply a data augmentation vector, and we can easily write the distribution of the observed data 
without it. Notice that Step 2 of the generative process assumes that the X l]r are independent given 
0, . In psychometrics this is known as a local independence assumption. This exchangeability as- 
sumption allows us to write the joint distribution of the response vector X t = (Xu , ..., X,j), con- 
ditional on Oi as 

J 

F(x\0i) = J] 

3 = 1 


K 

E 

k= 1 


OikF k j (pXj ) 


(3.4) 
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This conditional independence assumption also contains the assumption that the profile distri- 
butions are themselves factorable. If an individual belongs exclusively to profile k (for example, an 
image contains only water), then 0 lk . = 1, and all other elements in the vector 9 t are zero. Thus, 

F{X \e ik = 1) =Y[F kj ( Xj ) = F k (x). (3.5) 

3 

The sampling scheme level includes the assumptions about the observed replications. Step 2(a) 
of the generative process assumes that replications are independent given the membership vector 0 t . 
Thus the individual response distribution becomes: 


J Rij T 


F ix\0i ) = n n 


j = 1 r= 1 


K 

E 

Lfc=l 


®ikF k j [Xj r ) 


(3.6) 


Note that Equations (3.3), (3.4), and (3.6) vary for each individual with the value of It is in 
this sense that MMM is an individual-level mixture model. The distribution of variables for each 
profile, the Fj-j, is fixed at the population level, so that the components of the mixture are the same, 
but the proportions of the mixture change individually with the membership parameter 9 t . 

The latent variable level corresponds to Step 1 of the generative process. We can treat the 
membership vector 9 as either fixed or random. If we wish to treat 9 as random, then we can integrate 
Equation (3.6) over the distribution of 9, yielding: 


F{ x) 


r> J Rij 

nn 

j = 1 r= 1 


' K 

E 

.k = 1 


9i k Fk j {Xj ) 


dD(8). 


(3.7) 


The final layer of assumptions about the latent variable 9 is crucial for purposes of estimation, but 
it is unimportant for the discussion of mixed membership model properties in this chapter. All of 
the results presented here flow from the exchangeability assumption in Equation (3.4), and hold 
whether we use Equation (3.6) or (3.7) for estimation. 


3.3 The Development of Mixed Membership 

Independently, Woodbury et al. (1978), Pritchard et al. (2000), and Blei et al. (2003) developed 
remarkably similar mixed membership models to solve problems in three very different content 
areas. 

Grade of Membership Model 

The Grade of Membership model (GoM) is by far the earliest example of mixed membership (Wood- 
bury et al., 1978). The motivation for creating this model came from the problem of designing a 
system to help doctors diagnose patients. The problems with creating such a system are numerous: 
Patients may not have all of the classic symptoms of a disease, they may have multiple diseases, 
relevant information may be missing from a patient’s profile, and many diseases have similar symp- 
toms. 

In this setting, the mixed membership profiles represent distinct diseases. The observed data X.y 
are categorical levels of indicator j for patient i. The profile distributions F k j{xj ) indicate which 
level of indicatory is likely to be present in disease />::. Since X. rj is categorical, and there is only one 
measurement of an indicator for each patient, the profile distributions are multinomial with n = 1. 
In this application, the individual’s disease profile is the object of inference, so that the likelihood 
in Equation (3.4) is used for estimation. 
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Pritchard et al. (2000) models the genotypes of individuals in a heterogeneous population. The 
profiles represent distinct populations of origin from which individuals in the current population 
have inherited their genetic makeup. 

The variables Xj are the genotypes observed at J locations, and for diploid individuals two 
replications are observed at each location ( Rj = 2). Across a population, a finite number of distinct 
alleles are observed at each location j, so that X } is categorical and h) ;j is multinomial for each 
sub-population k. 

In this application, the distribution of the membership parameters 9i is of as much interest as 
the parameters themselves. The parameters 9, are treated as random realizations from a symmetric 
Dirichlet distribution. It is important to note that a symmetric Dirichlet distribution will result in an 
identifiability problem that is not present when 9 has an asymmetric distribution (Galyardt, 2012). 

One interesting feature of the admixture model is that it includes the possibility of both unsu- 
pervised and supervised learning. Most mixed membership models are estimated as unsupervised 
models. That is, the models are estimated with no information about what the profiles may be and 
no information about which individuals may have some membership in the same profiles. Pritchard 
et al. (2000) considers the unsupervised case, but also considers the case where there is additional 
information. In this application, the location where an individual bird was captured means that it is 
likely a descendent of a certain population with a lower probability that it descended from an immi- 
grant. This information is included with a carefully constructed prior on 9, which also incorporates 
rates of migration. 

Latent Dirichlet Allocation 

Latent Dirichlet allocation (Blei et al., 2003) is in some ways the simplest example of mixed mem- 
bership, as well as the most popular. LDA is a textual analysis model, where the goal is to identify 
the topics present in a corpus of documents. Mixed membership is necessary because many docu- 
ments are about more than one topic. 

LDA uses a “bag-of-words” model, where only the presence or absence of words in a document 
is modeled and word order is ignored. The individuals i are the documents. The profiles k represent 
the topics present in the corpus. LDA models only one variable, the words present in the documents 
( J = 1). The number of replications Rjj is simply the number of words in document i. The profile 
distributions are multinomial distributions over the set of words: Fkj = Multinomial(Ak,n = 1), 
where A kw is the probability of word w appearing in topic k. LDA uses the integrated likelihood in 
Equation (3.7). The focus here is on estimating the topic profiles, and the distribution of membership 
parameters, rather than the 9 t themselves. LDA also uses a Dirichlet distribution for 9, however it 
does not use a symmetric Dirichlet, and so it avoids the identifiability issues that are present in the 
admixture model (Galyardt, 2012). 

3.3.1 Variations of Mixed Membership Models 

Variations of mixed membership models fall into two broad groups: The first group alters the distri- 
bution of the membership parameter 9 , the second group alters the profile distributions Fkj- 

Membership Parameters 

The membership vector 9 is non-negative and sums to 1 so that it lies within a K — 1 dimensional 
simplex. The two most popular distributions on the simplex are the Dirichlet and the logistic-normal. 

Both LDA and the population admixture model use a Dirichlet distribution as the prior for the 
membership parameter. This is the obvious choice when the data is categorical, since the Dirichlet 
distribution is a conjugate prior for the multinomial. However, the Dirichlet distribution introduces 
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a strong independence condition on the components of 6 subject to the constraint ffi, ®ik = 1 
(Aitchison, 1982). 

In many applications, this strong independence assumption is a problem. For example, an ar- 
ticle with partial membership in an evolution topic is more likely to also be about genetics than 
astronomy. In order to model an interdependence between profiles, Blei and Lafferty (2007) uses a 
logistic-normal distribution for 6. Blei and Lafferty (2006) takes this idea a step further and creates 
a dynamic model where the mean of the logistic -normal distribution evolves over time. 

Fei-Fei and Perona (2005) analyzes images, where the images contain different proportions of 
the profiles water, sky, foliage, etc. However, images taken in different locations will have a different 
underlying distribution for the mixtures of each of these profiles. For example, rural scenes will 
have more foliage and fewer buildings than city scenes. Fei-Fei and Perona (2005) addresses this by 
giving the membership parameters a distribution that is a mixture of Dirichlets. 

Profiles 

In all three of the original models, the data are categorical and the profile distributions Fkj are 
multinomial. More recently, we have seen a variety of mixed membership models for data that is 
not categorical, with different parametric families for the I),- distributions. 

Latent process decomposition (Rogers et al., 2005) describes the different processes that might 
be responsible for different levels of gene expression observed in microarray datasets. In this appli- 
cation, X t j measures the expression level of the jth gene in sample i, a continuous quantity. This 
leads to profile distributions F jy = N(pkj,&kj)- 

The simplical mixture of Markov chains (Girolami and Kaban, 2005) is a mixed membership 
model where each profile is characterized by a Markov chain transition matrix. The idea is that over 
time an individual may engage in different activities, and each activity is characterized by a probable 
sequence of actions. 

The mixed membership naive Bayes model (Shan and Banerjee, 2011) is another extension 
of LDA which seeks to define a ‘generalization’ of LDA. This model simply requires the profile 
distributions Ff- :) to be exponential family distributions. This is a subset of models that falls within 
Erosheva’s general mixed membership model (Erosheva et al., 2004). Moreover, other exponential 
family profile distributions will not have the same properties as the multinomial profiles used in 
LDA (Galyardt, 2012). The main contribution of Shan and Banerjee (2011) is a comparison of 
different variational estimation methods for particular choices of Fkj. 


3.4 The Finite Mixture Model Representation 

Before we discuss the relationship between mixed membership models (MMM) and finite mixture 
models (FMM), we will briefly review FMM. 

3.4.1 Finite Mixture Models 

Finite mixture models (FMM) go by many different names, such as “latent class models” or simply 
“mixture models,” and they are used in many different applications from psychometrics to clustering 
and classification. 

The basic assumption is that within the population there are different subgroups, s = 1, . . . , S, 
which may be called clusters or classes depending on the application. Each subgroup has its own 
distribution of data, F s (x), and each subgroup makes up a certain proportion of the population, tt 3 . 
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The distribution of data across the population is then given by: 

s 

F(x) = Y / Ks f s(x). (3.8) 

S=1 

For reference, the distribution of data over the population in a MMM, given by Equation (3.7), is: 

J Rio 

nn 

3 = 1 r=l 

Finite mixture models can be considered a special case of mixed membership models. In a mixed 
membership model, the membership vector 6i indicates how much individual i belongs to each of 
the profiles thus 9 lies in a K — 1 dimensional simplex. If the distribution of the membership pa- 
rameter 9 is restricted to the corners of the simplex, then 9, will be an indicator vector and Equation 
(3.9) will reduce to the form of Equation (3.8). So a finite mixture model is a special case of mixed 
membership with a particular distribution of 9. 

3.4.2 Erosheva’s Representation Theorem 

Even though FMM is a special case of MMM, every MMM can be expressed in the form of an 
FMM with a potentially much larger number of classes. Haberman (1995) suggests this relationship 
in his review of Manton et al. (1994). Erosheva et al. (2007) shows that it holds for categorical data 
and indicates that the same result holds in the general case as well. Here the theorem is presented in 
a general form. 

Before we consider the formal version of the theorem, we can build some intuition based on the 
generative version of MMM. In the generative process, to generate the data point X i]r for individual 
i’s replication r of variable j, we first draw an indicator variable Z l]r ~ Multinomial (9 f) that 
indicates which profile Xy r will be drawn from. Let us write Z i? - r in the form: Zij r = k, if Xij r was 
drawn from profile k. Effectively, Z indicates that individual i ‘belongs’ to profile k for observation 

jr- 

The set of all possible combinations of Z defines a set of FMM classes, which we shall write as 
2 = {1, , K} R , where R is the total number of replications of all variables. For individual i, let 
= (Zi it, . . . , ZijRj) £ Z. So (i indicates which profile an individual belongs to for each and 
every observed variable. 

Representation Theorem. Assume a mixed membership model with J features and K profiles. To 
account for any replications in features, assume that each feature j has Rj replications, and let 
R = ffj-i Rj- Write the profile distributions as 

R 

F k (x) = F kr (x r ). 

r—1 

Then the mixed membership model can be represented as a finite mixture model with components 
indexed by ( £ {1, . . . , K = Z, where the classes are 

F ( FMM (x) = l[F Cr , r (x r ) (3.10) 

r—1 

and the probability associated with each class £ is 

" R 

7r c = E Yl 9 Cr 

_r= 1 




J2^F k j(xj) dD(Q) 


(3.11) 
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Proof. Begin with the individual mixed membership distribution, conditional on 9i. 

F(- x\&i) = TTy ^9 ik F kr {x r ), 

r k 

= TT ^iC, r F ^ T r{Xr)- 

Cez r 


(3.12) 

(3.13) 


Equation (3.13) reindexes the terms of the finite sum when Equation (3.12) is expanded. Distributing 
the product over r yields Equation (3.14): 


F (x\9i) = 

C62 

= X 77 i<; F c(x). 

Integrating Equation (3.15) yields the form of a finite mixture model: 




Kt 


F Q r r{xf} 


F(x) = E 0 


T,*« F dx) 

Cez 


= ^H F d x )- 

C 62 


(3.14) 

(3.15) 


(3.16) 

□ 


Erosheva’s representation theorem states that if a mixed membership model needs K profiles to 
express the diversity in the population, an equivalent finite mixture model will require K R compo- 
nents. In addition, if we compare Equation (3.15) to Equation (3.16), then we see that each individ- 
ual’s distribution is also a finite mixture model, with the same components as the population FMM 
but with individual mixture proportions. 

The mixed membership model is a much more efficient representation for high-dimensional 
data — we need only K profiles instead of K R . However, there is a tradeoff in the constraints on the 
shape of the data distribution (Galyardt, 2012). The rest of this chapter will explore some of these 
constraints. 


3.5 A Simple Example 

A finite mixture model is described by the components of the mixture and the proportion associ- 
ated with each component, The representation theorem tells us that when a MMM is expressed in 
FMM form, the components are completely determined by MMM profiles (Equation 3.10), and that 
the proportions are completely determined by the distribution of the membership vector 6 (Equation 
3.11). 

We can think of the MMM profiles F k j as forming a basis for the FMM components Fq. Con- 
sider a very simple example with two dimensions ( J = 2) and two profiles (K = 2). Suppose that 
the first profile has a uniform distribution on the unit square and the second profile has a concen- 
trated normal distribution centered at (0.3, 0.7): 


Ft Or) 

F 2 (x) 


F n(x i) x F 12 (x 2 ) = Unif{ 0, 1) x Unif( 0, 1), 
F 2 i (* i) x ^ 22 ( 2 : 2 ) = JV(0.3,0.1) x iV(0.7,0.1). 


(3.17) 

(3.18) 
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From a generative perspective, an individual with membership vector 0, = {On, Oa) will have 
Zn = 1 with probability On and Zn = 2 with probability 0i 2 , so that X i: j ~ Unif{0, 1) with 
probability On, and X- Lj ~ 7V(0.3, 0.1) with probability #, 2 . Similarly, for variable j = 2, with 
probability On, Z i2 = 1 , and with probability 0 i2 , Z i2 = 2. In total, there are K J = 4 possible 
combinations of Q = {Zn, Z & ): 

*i|C< = (M) ~ Unif{0, 1) x Unif{0, 1), (3.19) 

= (1, 2) ~ Unif{ 0, 1) x JV(0.7, 0.1), (3.20) 

Xi\Ci = { 2,1) ~ 1V(0.3,0.1) x Unif {0,1), (3.21) 

Xi|C: = (2,2) ~ 1V(0.3, 0.1) x N(0.7,0.1). (3.22) 


Equations (3.19)-(3.22) are the four FMM components for this MMM model, (Figure 3.1), 
and they are formed from all the possible combinations of the MMM profiles Fkj- It is in this sense 
that the MMM profiles form a basis for the data distribution. 

The membership parameter 0i governs how much individual i ‘belongs’ to each of the MMM 
profiles. If On > 0i 2 , then Q = (1, 1) is more likely than Q = (2,2). Notice, however, that since 
multiplication is commutative, 0n0a = 0i 2 0n, so that Q = (1, 2) always has the same probability 
as Ci = (2, 1). 

Figure 3.2 shows the data distribution of this MMM for two different distributions of 0. The 
change in the distribution of 0 affects only the probability associated with each component. Thus 
the MMM profiles define the modes of the data, and the distribution of 0 controls the height of the 
modes. 

Alternate Profiles 

Consider an alternate set of MMM profiles, G : 

G\{x) = Unif(0, 1) x 7V(0.7, 0.1), (3.23) 

G 2 {x) = 1V(0.3,0.1) x Unif {0,1). (3.24) 

The G profiles are essentially a rearrangement of the F profiles, and will generate exactly the 
same FMM components as the F profiles (Figure 3.3). For any MMM model, there are /\ ! ! ' / ~ ! 1 sets 
of basis profiles which will generate the same set of components in the FMM representation (Gal- 
yardt, 2012). The observation that multiple sets of MMM basis profiles can generate the same FMM 
components has implications for the identifiability of MMM, which is explored fully in Galyardt 
( 2012 ). 

Multivariate Xj 

The same results hold when Xj is multivariate. Consider an example where each profile f) ;j is a 
multivariate Gaussian, as used in the GM-LDA model in Blei and Jordan (2003). Then we can write 
the profiles as: 

F\{x) = Fn{x\) x Fi 2 {x 2 ) = MvN{pn,Y,n) x MvN{p 12 ,Y,i 2 ), 

F 2 {x) = F 2 \{x\) x F 22 {x 2 ) = MvN{p 21 ,E 21 ) x MvN{p 22 ,E 22 ). 


The corresponding FMM components are then: 

-X<|C< = (1, 1) ~ MvN{pn,T,n) x MvN{pi 2 ,Yii 2 ), (3.25) 

^16 = (1,2) ~ MvN{pn,T,n) x MvN{p 22 ,Yi 22 ), (3.26) 

Xi|Ci = (2,l) ~ MvN{p 2 \,T, 2 \) x MvN{p\ 2 ,Y,\ 2 ), (3.27) 

^16 = (2,2) - MvN{p 21 ,^ 21 )xMvN{p 22 ,V 22 ). (3.28) 
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FIGURE 3.1 

Each of the four boxes shows the contour plot of an FMM component in Equations (3. 19)— (3.22). 
They correspond to the MMM defined by the F profiles in Equations (3. 17)— (3. 18). X\ and are 
the two observed variables. Lighter contour lines indicate higher density. 
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FIGURE 3.2 

Contour plot of the MMM defined by the profiles in Equations (3. 17)— (3. 18) with two different 
distributions of 9. X-\ and X 2 are the two observed variables. Lighter contour lines indicate higher 
density; the scale is the same for both figures. 



50 


Handbook of Mixed Membership Models and Its Applications 



Gl,2 



0.00 0.25 0.50 0.75 1.00 


Xl 


G2,2 

1 . 00 - 

0 . 75 - 

£* 0.50 - 

0 . 25 - 

O.OO-i 
0.00 


0.25 


0.50 

Xl 


0.75 


1.00 


FIGURE 3.3 

Each of the four boxes shows the contour plot of an FMM component corresponding to the MMM 
defined by the G profiles in Equations (3.23)-(3.24). Note that these are the same components as 
those defined by the F profiles in Figure 3.1 and Equations (3.19)-(3.22), simply re-indexed. X\ 
and X 2 are the two observed variables. Lighter contour lines indicate higher density. 
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There are still K R FMM components; the only difference is that these clusters are not in an 
//-dimensional space but a higher-dimensional space, depending on the dimensionality of the Xj . 


3.6 Categorical vs. Continuous Data 

All three of the original mixed membership models, and a majority of the subsequent variations, 
were built for categorical data. This focus on categorical data can lead to intuitions about mixed 
membership models which do not hold in the general case. Since every mixed membership model 
can be expressed as a finite mixture model, the best way to understand the difference between 
continuous and categorical data in MMM is to focus on how different data types behave in FMM. 

Let us begin by considering the individual distributions conditional on profile membership 
(Equation 3.3): 

K 

F (xj\Bi) = ^6 ik F kj (x j ). 

k — 1 

In general, this equation does not simplify, but in the case of categorical data, it does. This is the 
key difference between categorical data and any other type of data. 

If variable Xj is categorical, then we can represent the possible values for this variable as 
l \, . . . , tj J . We represent the distribution for each profile as F k j{xj) = Multinomial (X k j,n = 1), 
where A k j is the probability vector for profile k on feature j. and n is the number of multinomial 
trials. The probability of observing a particular value l within basis profile k is written as: 

Pr{Xj = l\d k = 1) = A kji . (3.29) 

The probability of individual i with membership vector 0, having value I for feature j is then 

K K 

Pr(Xij = l\6i) = J2 9 *k p r(Xj = l\9 k = 1) = ^0 ifc A kji . (3.30) 

k - 1 fc=l 

Consider LDA as an example. Assume that document i belongs to the sports and medicine top- 
ics. The two topics each have a different probability distribution over the lexicon of words, say 
Multinomial(\ s ) and Multinomial ( X m ). The word elbow has a different probability of appear- 
ing in each topic. A.,.,, and A m>e , respectively. Then the probability of the word elbow appearing in 
document i is given by \ = 0 is \ s e + 0 vm X rn e . Since the vector 9i sums to 1, the individual prob- 
ability A i must be between A Sje and X m , e . The individual probability is between the probabilities in 
the two profiles. 

We can simplify the mathematics further if we collect the X k j into a matrix by rows and call this 
matrix Xj. Then Of X, is a vector of length Lj where the Zth entry is individual i's probability of 
value l on feature j, as in Equation (3.30). 

We can now write individual i’s probability vector for feature j as 

Xij=eJXj. (3.31) 

The matrix Xj defines a linear transformation from to Ay, as illustrated in Figure 3.4. Since 0,; 
is a probability vector and sums to 1, Ay is a convex combination of the the profile probability 
vectors X kj . Thus the individual Ay lies within a simplex where the extreme points are the X k] . In 
other words, the individual response probabilities lie between the profile probabilities. This leads 
Erosheva et al. (2004) and others to refer to the profiles as “extreme profiles.” For categorical data, 
the parameters of the profiles form the extremes of the individual parameter space. 
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Moreover, since the mapping from the individual membership parameters 0-, to the individual 
feature probabilities \ rj is linear, the distribution of individual response probabilities is effectively 
the same as the population distribution of membership parameters (Figure 3.4). 

Thus, when feature Xj is categorical, an individual with membership vector 0, has a probability 
distribution of 

F(xj\8i) = Multinomial (6 f Xj,n = 1). (3.32) 

This is the property that makes categorical data special. When the profile distributions are multi- 
nomial with n = 1, the individual-level mixture distributions are also multinomial with n = 1. 
Moreover, we also have that the parameters of the individual distributions, the Of Xj, are convex 
combinations of the profile parameters, the Xkj- In this sense, when the data are categorical, an 
individual with mixed membership in multiple profiles is effectively between those profiles. 

In general, this between relationship does not hold. The general interpretation is a switching 
interpretation, and is clearly captured by the indicator variable Z t j r in the generative model. Z, :]r 
indicates which profile distribution k generated the observation X, jr . Thus, Z indicates that an 
individual switched from profile k for the j th variable to profile k' for the j + 1 st variable. 

The between interpretation for categorical data only holds in the multinomial parameter space: 
A t is between the profile parameters Afc. The behavior in data space is the same switching behavior 
as defined in the general case. Individuals may only give responses that are within the support of at 
least one of the profiles. 

Consider LDA as an example. The observation Xi r is the rth word appearing in document i\ 
each profile is a multinomial probability distribution over the set of words. “Camel” may be a high 
probability word in the zoo topic, while “cargo” has high probability in the transportation topic. 
For a document with partial membership in the zoo and transportation topics, the word camel will 
have a probability of appearing that is between the probability of camel in the zoo topic and its 
probability in the transportation topic. Similarly for the word cargo. However, it doesn’t make sense 


Profile 1 



fa 


FIGURE 3.4 

The membership parameter 0 t lies in a K — 1 simplex. When the mixed membership profiles are 
F k] = Multinomial (Xkj , n = 1), the membership parameters are mapped linearly onto response 
probabilities (Equation 3.31), indicated by the arrow. The density, indicated by the shading, is pre- 
served by the linear mapping. This mapping allows us to interpret individual i’s position in the 
0-simplex as equivalent to their response probability vector. 
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to talk about the word “cantaloupe” being between camel and cargo. With categorical data, there is 
no ‘between’ in the data-space. The between interpretation only holds in the parameter space. 

Consider another example: suppose that we are looking at response times for a student taking 
an assessment, where X i: j is the response time of student i on item j and each profile represents a 
particular strategy. Suppose that one strategy results in a response time with a distribution AT(10, 1) 
and another less effective strategy has a response time distribution of N( 20, 2). In the mixed mem- 
bership model, an individual with membership vector 9i = ■ 0-n) then has a response time 

distribution of 9nN (10, 1) + O^N (20, 2). This individual may use strategy 1 or strategy 2, but a re- 
sponse time of 15 has a low probability under both strategies and in the mixture. The individual may 
switch between using strategy 1 and strategy 2 on subsequent items, but a response time between 
the two distributions is never likely, no matter the value of 9. Moreover, the individual distribution 
is no longer normal but a mixture of normals (Titterington et ah, 1985). Thus, for this continuous 
data, we can use a switching interpretation, but a between interpretation is unavailable. 


3.6.1 Conditions for a ‘Between’ Interpretation 

The between interpretation arises out of a special property of the multinomial distribution: the indi- 
vidual probability distributions are in the same parametric family as the profile distributions, multi- 
nomial with n = 1, and the individual parameters are between the profile parameters (Equation 3.31 
and Figure 3.4). 

For the between interpretation to be available, this is the property we need to preserve. The 
individual distributions F(x\9i) must be in the same parametric family as each profile distribution 
.Ffc. Additionally, if F is parameterized by fa then the individual parameters fa must lie between the 
profile parameters 4> k - 

Thus, the property we are looking for is that an individual with membership parameter 9i would 
have an individual data distribution of F(X; 9f fa), so that for each variable j we would have: 

Xij\9i ~ '52eik F kj( x j\<t>kj) = F j( x j-,0i < l>. j )' (3-33) 

k 

In other words, the between interpretation is only available if the profile cumulative distribution 
functions (cdfs) are linear transformations of their parameters. The only exponential family distri- 
bution with this property is the multinomial distribution with n = 1. Thus, it is the only common 
profile distribution which allows a between interpretation (Galyardt, 2012). 

The partial membership models in Gruhl and Erosheva (2013) and Mohamed et al. (2013) use 
a likelihood that is equivalent to Equation (3.33) in the general case. This fundamentally alters 
the mixed membership exchangeability assumption for the distribution of \0, and preserves the 
between interpretation in the general case. 


Example 

We will focus on a single variable j, omitting the subscript j within this example for simplicity. 
Let the profile distributions be Gaussian mixture models with proportions • • • , Pks) and 

fixed means c s . If we denote the cdf of the standard normal distribution as <f>, then we can write the 
profiles as 

F k (x) = F(x; j3k) = ^2l3ks$(x - c s ). (3.34) 

S 

Define fa s = 9f (/3i s , ■ ■ ■ , Pks)- Then the individual distributions, conditional on the membership 
vector 9i, are 


x \9i 


9ik 


yy x - c s ) 


(3.35) 
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= Yju*(x ~ C s ), (3.36) 

S 

= F(x; /?,). (3.37) 


Thus, the individual parameter Bi is in between the profile parameters Bk- 

Now let us change the profile distributions slightly. Suppose the means are no longer fixed 
constants but are also variable parameters: 


F$(x)=F*{x-,p k>l i)=Y,0k,*(x-n a ). (3.38) 

S 

In this case the individual conditional distributions are given by 


x\e. 


Y, 


Y>3 ks^(x-p s ) , 


(3.39) 


= F*(x;/3i,p). 


(3.40) 


Figure 3.5 shows three example profiles of this form and the distribution of X\6i for two indi- 
viduals. Here, the between interpretation does not hold in the entire parameter space. Individual data 
distributions are the same form as the profile distributions — both are in the F* parametric family. 
However, F* has two parameters, B and //. The individual mixing parameter Bi will lie in a simplex 
defined by the profile parameters jBk , since Bis = (Bis, ■ ■ ■ > Bks )• 

The fact that the individual mixing parameter 8% is literally ‘between’ the profile mixing param- 
eters Bk allows us to interpret individuals as a ‘blend’ of the profiles. The same is not true for the p 
parameter. We only have the between interpretation when considering the 8 parameters. 

Now, let’s make another small change to the profile distributions. Suppose that the standard 
deviation of the mixture components is not the same for each profile: 


F l(x) = F^(x;Bk,ltd Tk) =Y,Pks^ 

S 

Now the conditional individual distributions are 

X\ Ot ~ Y e * F k( x )> 


X - p a 

<Xk 


= £< 




x - Ps 

&k 


- ££ @ikBks*& 


k s 


X — p t 
&k 


(3.41) 


(3.42) 

(3.43) 

(3.44) 


Equation (3.44) does not simplify in any way. The conditional individual distribution is no longer 
of the /-’ form and as such does not have parameters that are between the profile parameters. Figure 
3.6 is an analog of Figure 3.5 and shows three F^ profiles and the distribution of X\0, for two 
individuals. 

This example is analogous to the model of genetic variation, mStruct (Shringarpure, 2012). In 
this model, the population is comprised of K ancestral populations, and each member of the current 
population has mixed membership in these ancestral populations. mStruct also accounts for the fact 
that the current set of alleles may contain mutations from the ancestral set of alleles. 

Each ancestral population has different proportions Bk = (Bki, ■ • ■ > Bks) of the set of founder 
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FIGURE 3.5 

Non-multinomial profile distributions that preserve the ‘between’ interpretation. The top graph 
shows three profiles of the form F% = /3ks$(x — A^s) (Equation 3.38). The mixture means 

p s and the standard deviations are the same for each profile. The lower graph shows two individual 
distributions where X\6i ~ F*(x; fa, p) (Equation 3.40). 



56 


Handbook of Mixed Membership Models and Its Applications 



0 . 15 - 

0 . 10 - 

0 . 05 - 

0 . 00 - 

0 . 15 - 

0 . 10 - 

0 . 05 - 

0 . 00 - 




FIGURE 3.6 

Profile distributions that do not preserve the ‘between’ interpretation. The top graph shows three 
profiles of the form Pks^ f ok ) (Equation 3.41). The mixture means are the same 

for each profile, but the standard deviations are different. The lower graph shows two individual 
distributions with X\9i ~ J2k Us QikPka® (Equation 3.44). 
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alleles at locus j : p,j = (pji, ■ ■ . , /Xjs)- The observed allele for individual i at locus j, Xij, will 
have mutated from the founder alleles according to some probability distribution P(- \pj s , $kj), with 
the mutation rate S k j differing depending on the ancestral population. Thus, the profile distributions 
are 

s 

^kj ('t'j ) — ; fikj ■ b j •, ^kj ) — ^ v f^kj s F(x\ Pj s . itfcj ) . (3.45) 

s=l 

The individual probability distribution of alleles at locus j, conditional on their membership in 
the ancestral profiles is then given by 




dik 


y ' Pkjs-^ix | djs • 3 kj ) 


(3.46) 


In the same way that the conditional individual distributions in the F' model (Equation 3.44) do 
not simplify, the individual distributions in the mStruct do not simplify. 


3.7 Contrasting Mixed Membership Regression Models 

In this section, we compare and contrast two mixed membership models which are identical in the 
exchangeability assumptions and the structure of the models. The only difference is that in one case 
the data is categorical, and in the other case it is continuous. In the categorical case, the between 
interpretation holds and mixed membership is a viable way to model the structure of the data. In the 
continuous case, the between interpretation does not hold and mixed membership cannot describe 
the variation that is present in the data. 

Let us suppose that in addition to the variables we also observe a set of covariates T i: j . For 
example, T may be the date a particular document was published or the age of a participant at the 
time of the observation. In this case, we may want the MMM profiles to depend on these covariates: 
F k (x\t). There are many ways to incorporate covariates into F, but perhaps the most obvious is a 
regression model. 

Every regression model, whether linear, logistic, or nonparametric is based on the same fun- 
damental assumption: E[X|T = t] = m(t). When X is binary, X\T = t ~ Bernoulli(?n(f)). 
When X is continuous, we most often use X\T = t ~ N(rn(t). a 2 ). In general, we tend not 
to treat these two cases as fundamentally different, they are both just regression. The contrast be- 
tween these two mixed membership models is inspired by an analysis of the National Long Term 
Care Survey (Manrique-Vallier, 2010) and an analysis of children’s numerical magnitude estimation 
(Galyardt, 2010; 2012). In Manrique-Vallier (2010), X is binary and T is continuous, so that the 
MMM profiles are 

Fk{x\t) = Bernoulli (r?Zfc(f)). (3.47) 

In Galyardt (2010), both X and T are continuous, so that the MMM profiles are 

F k (x\t) = N(m k (t), a'l). (3.48) 

Note, however, that for the reasons explained here and detailed in Section 3.7.2, a mixed member- 
ship analysis of the numerical magnitude estimation data was wildly unsuccessful (Galyardt, 2010). 
An analysis utilizing functional data techniques was much more successful (Galyardt, 2012). 

The interesting question is why an MMM was successful in one case and unsuccessful in the 
other. At the most fundamental level, the answer is that a mixture of Bernoullis is still Bernoulli, 
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and a mixture of normals is not normal. This is a straightforward application of Erosheva’s repre- 
sentation theorem. 

To simplify the comparison, let us suppose that we observe a single variable (J = 1), with 
replications at points T r , r = 1, . . . , R. For example, X lr may be individual V s response to a single 
survey item observed at different times T, r . To further simplify, we will use only K = 2 MMM 
profiles with distributions F(x; m k {t)). Thus for an individual with membership parameter 9i, the 
conditional data distribution is: 


XilTiA 


n 


y 9 lk F(X ir ; m k (T ir )) 

. k 


(3.49) 


3.7.1 Mixed Membership Logistic Regression 

When the MMM profiles are logistic regression functions (Equation 3.47), then the conditional data 
distribution for an individual with membership parameter 0, becomes 


XilTiA 


n 


T; 9 ik Bernoulli(m k (T ir )) , 

_ k 


with 

m k (t) = logit - 1 (f3 0k + /3 lk t). 

Equation (3.50) is easily rewritten as 


(3.50) 


(3.51) 


XilTiA ~ n 

r 

In this case, we can write an individual regression function, 

mft) = y> fc m fc (t). 
k 


Bernoulli £ Oikjn k (T ir 


(3.52) 


(3.53) 


This individual regression function rn, does not have the same loglinear form as m k , so we 
cannot talk about individual /3 parameters being between the profile parameters. However, it is a 
single smooth regression function that summarizes the individual’s data, and m, will literally be 
between the m k ■ Figure 3.7 shows an example with two such logistic regression profile functions 
and a variety of individual regression functions specified by this mixed membership model. 


3.7.2 Mixed Membership Regression with Normal Errors 

When the MMM profiles are regression functions with normal errors (Equation 3.48), the condi- 
tional distribution for individual i’s data is given by 


Xi\Ti,0i 


n 


^2d ik N ( m k (T ir ),ol ) 

. k 


(3.54) 


Since a mixture of normal distributions is not normal. Equation (3.54) does not simplify. In 
this case it is impossible to write a smooth regression function m*. Figure 3.8 demonstrates this 
by showing two profile regression functions and contour plots of the density for two individuals, 

Xi\ Ti,0i. 

It can be tempting to suggest that a change in the distribution of the membership parameter 9 may 
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FIGURE 3.7 

Profile and individual regression functions in a mixed membership logistic regression model. The 
thick dashed lines indicate the profile regression functions rrik(t). The thin lines show individual 
regression functions rrii(t) for a range of values of 0,. 

resolve this issue. However, according to Erosheva’s representation theorem, the profile distributions 
h). control where the data is and 0 only controls how much data is in each location (Equations 3.10, 
3.11, and Section 3.5). Figure 3.9 illustrates the result of making 0,; a function of t, 0,(i). 

If the profile distributions F are linear transformations of their parameters (Equation 3.33), 
then a mixed membership regression model with profiles F{m{ x)) will have individual regression 
functions rrii(x). Otherwise a mixed membership model will not produce continuous individual 
regression functions. 

Functional data are a class of data of the form Xij = + e t j, where f t is an individual 

smooth function, but we only observe a set of noisy measurements X i:j and t, :) for each individual 
(Ramsay and Silverman, 2005; Serban and Wasserman, 2005). For example, suppose we observe 
the height of children at different ages, or temperature at discrete intervals over a period of time. In 
this type of data analysis, the functions /) and the similarities and variation between them are the 
primary objects of inference. 

The examples in this section demonstrate that without fundamentally altering the exchange- 
ability assumption of the general mixed membership model (Equation 3.4), a MMM cannot fit 
functional data. Equation (3.54) will never produce smooth individual regression functions. Gal- 
yardt (2012), Gruhl and Erosheva (2013), and Mohamed et al. (2013) suggest a way in which the 
exchangeability assumption might be altered to model individual regression functions as lying be- 
tween the profile functions. 

3.7.3 Children’s Numerical Magnitude Estimation 

The mixed membership regression model with normal errors is based on an analysis of the strategies 
and representations that children use to estimate numerical magnitude. This has been an active area 
of research in recent years (Ebersbach et al., 2008; Moeller et al., 2008; Siegler and Booth, 2004; 
Siegler and Opfer, 2003; Siegler et al., 2009). The primary task in experiments studying numerical 
magnitude estimation is a number line task. The experimenter presents each child with a series of 
number lines which have only the endpoints marked. The scale of the number lines is most often 
0 to 100, or 0 to 1000. The child estimates a number by marking the position where they think the 
number ‘belongs.’ Each child will estimate a series of numbers, with a single number line on each 


page. 
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FIGURE 3.8 

Mixed membership regression model with normal errors. The two plots show contours of the data 
distribution for two different values of 9i. The thick dashed lines indicate the profile regression 
functions. Lighter contour lines indicate higher density. Note that there is no individual regression 
function rrii, which can summarize data from this distribution. 
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FIGURE 3.9 

Mixed membership regression model with normal errors. Contour plot of an individual data dis- 
tribution where 9n(t) is an increasing function of T. The thick dashed lines indicate the profile 
regression functions. Lighter contour lines indicate higher density. We cannot summarize data from 
this distribution with any smooth regression function rrii(t). 
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There are competing theories as to how children represent numerical magnitude and the strate- 
gies that they use to estimate numbers (Ebersbach et al., 2008; Galyardt, 2012; Moeller et al., 2008; 
Opfer and Siegler, 2007; Siegler et al., 2009). This argument is not our primary concern. We will 
focus on the aspect of performance that all of the studies agree upon: there is an immature pattern 
and a mature pattern. Older children are able to accurately and linearly estimate numerical magni- 
tude. That is, if T ir is the ?’th number you ask child i to estimate, then their estimates X rr can be 
modeled as X lr = T ir + ei r . 

Young children consistently overestimate small numbers. For example, a kindergardener esti- 
mating on the 0-100 scale may place the number 23 three-quarters of the distance from 0 to 100, 
near a position of 75. These children also appear to not differentiate well between larger quantities, 
so that they might place both 56 and 84 near a position of 90. The estimate from a child displaying 
the immature pattern will follow X lr = m(Xi r ) + ej r . The exact functional form of m(x) is dis- 
puted; Opfer and Siegler (2007) and Siegler et al. (2009) suggest that it is logarithmic; Ebersbach 
et al. (2008) and Moeller et al. (2008) suggest that it is piece-wise linear. 

At this point, it seems natural to model children who are learning the mature representation as 
having mixed membership in both representations (Galyardt, 2010). We can represent each strategy 
with a MMM profile and use the membership parameter to indicate the degree to which a child 
has learned the mature strategy. Thus the profiles are mixed membership regression functions with 
normal errors, as in Equation (3.54). The distribution of individual data predicted by this model 
would be similar to the distributions shown in Figure 3.8. This mixed membership model would 
embody a ‘switching’ interpretation; sometimes the child uses the mature strategy and sometimes 
the child uses the immature strategy. 

This is where the difference between the switching and blending interpretations becomes criti- 
cal. Children using the immature strategy will estimate the number 30 near the position 80, while 
those using the mature strategy will estimate the position accurately at 30. If a child is blending 
the two strategies, then a model should predict an estimate at a position between 30 and 80. On 
the other hand, if a child is switching between the mature and the immature strategy, then a model 
should predict estimates near these two points and have lower probability in the middle. 

Figure 3.10 shows data from a number line estimation task for six representative individuals. 
We can see immediately that this is functional data. Each child’s strategy can be represented by a 
single smooth curve, /,. 

Some children clearly display the immature pattern, some children display the mature pattern. 
The interesting patterns belong to the children between the two extremes. Yet the mixed membership 
regression model cannot capture this variation, even with the addition of more profiles. The profiles 
are normal, and since mixtures of normals are not normal, the individual distributions will not be 
normal. Therefore the exchangeability assumptions in Equation (3.54) will not produce a smooth 
regression function for each individual. 

In this kind of application, we want to model where each individual lies between the two ex- 
tremes. A mixed membership model cannot capture the patterns of variation that are present in this 
data. As one measure of model misfit, an attempt to use the mixed membership model with nor- 
mal errors (Equation 3.54) on this data resulted in estimates of a > 30, with data on a scale of 
0-100 (Galyardt, 2010). One way to solve this problem is to apply functional data analysis tools, 
the approach successfully used in Galyardt (2012). Another approach is to alter the exchangeabil- 
ity assumption to allow for a ‘between’ interpretation (Gruhl and Erosheva, 2013; Mohamed et al., 
2013). 
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Number to be Estimated 


FIGURE 3.10 

Each box displays data from a single child participant in Siegler and Booth (2004). Individuals were 
selected to display the range of strategies observed in the data. The immature and mature patterns 
are present, but other intermediate patterns are present as well. 


3.8 Discussion 

Everything presented in this chapter is a straightforward observation based on Erosheva’s repre- 
sentation theorem (Erosheva et al., 2007). Every mixed membership model can be expressed as a 
finite mixture model with a much larger number of classes. Therefore, the best way to understand 
how mixed membership models behave and how we should interpret them is by focusing on the 
relationship with finite mixture models. 

Categorical data and the multinomial distribution have a unique behavior within the family of 
finite mixture models. Therefore categorical data have a unique behavior within the family of mixed 
membership models. 

In general, individuals with mixed membership in multiple profiles should be interpreted as 
switching between the profiles. For example, a student who uses one strategy on one problem and 
switches to another strategy for the next problem; or one segment of an image from the water profile 
that then switches its next segment to the tree profile. This switching interpretation is inherent in the 
exchangeability assumption that observed variables are independent conditional on the individual’s 
membership parameter. 

Only in a small set of special cases, including the multinomial distribution, can we interpret 
mixed membership as individuals being between the profiles. In these cases, the general switching 
interpretation is also accurate. Think of an individual who has mixed heritage. In the between in- 
terpretation, we can consider this individual as blending the two heritages together. Whereas in the 
switching interpretation, one gene may come from one heritage while the next gene comes from 
another heritage. In this special case, both interpretations work. 

Changing the distribution of the membership parameters has no effect on which interpretations 
are available. Whether or not the profile distributions are linear transformations of their parameters 
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is the only thing that determines whether the between interpretation is available. The same property 
is at work in the more complicated regression examples as in the simple examples. 

Mixed membership models individuals switching between profiles. Partial membership (Gal- 
yardt, 2012; Gruhl and Erosheva, 2013; Mohamed et al., 2013) models individuals blending profiles. 
Only in very special cases do the two interpretations overlap. 
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In this chapter we show how mixture models, partial membership models, factor analysis, and their 
extensions to more general mixed membership models, can be unified under a simple framework 
using the exponential family of distributions and variations in the prior assumptions on the latent 
variables that are used. We describe two models within this common latent variable framework: 
a Bayesian partial membership model and a Bayesian exponential family factor analysis model. 
Accurate inferences can be achieved within this framework that allow for prediction, missing value 
imputation, and data visualization, and importantly, allow us to make a broad range of insightful 
probabilistic queries of our data. We emphasize the adaptability and flexibility of these models for a 
wide range of tasks, characteristics that will continue to see such models used at the core of modern 
data analysis paradigms. 
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4.1 Introduction 

Latent variable models are ubiquitous in machine learning and statistics and are core components 
of many of the most widely-used probabilistic models, including mixture models (Newcomb, 1886; 
Bishop, 2006), factor analysis (Bartholomew and Knott, 1999), probabilistic principal components 
analysis (Tipping and Bishop, 1997; Bishop, 2006), mixed membership models (Erosheva et al., 
2004), and matrix factorization (Lee and Seung, 1999; Salakhutdinov and Mnih, 2008), amongst 
others. The use and success of latent variables lies in that they provide us a mechanism with which 
to achieve many of the desiderata of modern data modeling: robustness to noise, allowing for ac- 
curate predictions of future events, the ability to handle and impute missing data, and providing in- 
sights into the phenomena underlying our data. For example, in mixture models the latent variables 
represent the membership of data points to one of a set of underlying classes; in topic models the 
latent variables allow us to represent the distribution of topics captured within a set of documents. 

The broad applicability of mixed membership models is expanded upon throughout this volume, 
and here we shall focus on simpler instances of the general mixed membership modeling framework 
to emphasize this wide applicability. In this chapter, we show how mixture models, factor analysis, 
and partial membership models and their generalization to mixed membership models can be unified 
under a common modeling framework. Moreover, we show how exponential family likelihoods can 
be used to provide a very general tool for modeling diverse data types, such as binary, count, or non- 
negative data, etc. Specifically, we will develop two models: a Bayesian partial membership model 
(BPM) (Heller et al., 2008) and a Bayesian exponential family factor analysis (EXFA) (Mohamed 
et al., 2008), and demonstrate the power of these models for accurate prediction and interpretation 
of data. 

As a case study, we will use an analysis of recorded votes: data that lists the names of those 
voting for or against a motion. In particular, we will focus on the roll call of the U.S. senate and 
demonstrate the different perspectives of the data that can be obtained, including the types of prob- 
abilistic queries that can be made with an accurate model of the data. Recorded votes are stored as a 
binary matrix and we describe a general approach for handling this type of data, and generally, any 
data that can be described by members of the exponential family of distributions. We develop two 
probabilistic models: the first is a model for partial memberships that allows us to describe sena- 
tors on a scale of fully-allegiant Democrats to fully-allegiant Republicans. This is a natural way of 
thinking about such data, since senators are often grouped into blocs depending on their degree of 
membership to these two groups, such as moderate Democrats, Republican majority, etc. Secondly, 
we develop factor models that provide a means of representing the underlying factors or traits that 
senators use in their decision making. These two models will be shown to arise naturally from 
the same probabilistic framework, allowing us to explore different assumptions on the underlying 
structure of the data. 

We begin our exposition by providing the required background on conjugate-exponential family 
models (Section 4.2.1). We then show that by considering a relaxation of standard mixture models 
we arrive naturally at two useful model classes: latent Dirichlet models and latent Gaussian models 
(which we expand upon in Section 4.4). In Section 4.2.3, we show that the assumption of Dirichlet 
distributed latent variables allows us to develop a model that quantifies the partial membership of 
objects to clusters, and that the assumption of continuous, unconstrained latent variables in Sec- 
tion 4.2.4 leads to an exponential family factor analysis. We focus on Markov chain Monte Carlo 
methods for learning in both models in Section 4.3. Whereas many types of mixed membership 
models focus on representing the data at two levels (e.g., a subject and a population level), here we 
operate at one level (subject level) only, and we describe the relationship between our approach and 
other mixed membership models such as latent Dirichlet allocation and mixed membership matrix 
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factorization (Blei et al., 2003; Erosheva et al., 2004; Mackey et al., 2010) in Section 4.4. We provide 
some experimental results and explore the roll call data in Section 4.5. 

Notation. Throughout this chapter we represent observed data as an N x D matrix X = 
[x 1; . . . , Xjv] t , with an individual data point x n = [ar„i, . . . , x n r>\. N is the number of data points 
and D is the number of input features. © is a K / D matrix of model parameters with rows 0/,. V 
is a N x K matrix V = [vi, . . . , vjv] t of latent variables with rows v„ = [v n \, . . . , v u k], which 
are /f-dimensional vectors of continuous values in R. K is the number of latent factors representing 
the dimensionality of the latent variable. 


4.2 Membership Models for the Exponential Family 
4.2.1 The Exponential Family of Distributions 

The choice of likelihood function p(x|rj) for parameters r) and observed data x is central to the 
models we describe here. In particular, we would like to model data of different types, i.e., data 
that may be binary, categorical, real-valued, etc. To achieve this objective, we make use of the 
exponential family of distributions, which is an important family of distributions that emphasizes 
the shared properties of many standard distributions, including the binomial, Poisson, gamma, beta, 
multinomial, and Gaussian distributions (Bickel and Doksum, 2001). The exponential family of 
distributions allows us to provide a singular discussion of the inferential properties associated with 
members of the family and thus, to develop a modeling framework generalized to all members of 
the family. 

In the exponential family of distributions, the conditional probability of x given parameter value 
r/ takes the following form: 


p(x|j?) = exp{s(x„) T T 7 + /i(x„) - g(rj)}, (4.1) 

where s(x n ) are the sufficient statistics, r) is a vector of natural parameters, h ( x„ ) is a function of the 
data, and g(rj) is the cumulant or log-partition function. For this chapter, the natural representation 
of the exponential family likelihood is used such that s(x) = x. For convenience, we shall represent 
a variable x that is drawn from an exponential family distribution using the notation x ~ Expon ( rj ), 
with natural parameters 77 . 

Probability distributions that belong to the exponential family also have corresponding conju- 
gate prior distributions p(rj), for which both p(rj) and p(x|r/) have the same functional form. The 
conjugate prior distribution for the exponential family distribution of Equation (4.1) is: 

p{rj) oc exp{A T T 7 - vg{rj) + /(A)}, (4.2) 

where A and v are hyperparameters of the prior distribution. We use the shorthand 77 ~ Conj (A, v) 
to denote draws from a conjugate distribution. 

As an example, consider binary data, for which an appropriate data distribution is the Bernoulli 
distribution and the corresponding conjugate prior is the Beta distribution. The Bernoulli distribution 
has the form p( x\p) = g x (\ — p) l ~ x , with /j in [0,1]. The exponential family form, using the 
terms in Equation (4.1), is described using h(x) = 0, 77 = ln(-j^) and < 7 ( 77 ) = ln(l + e v ). The 
natural parameters can be mapped to the parameter values of the distribution using the link function, 
which is the logistic sigmoid in the case of the Bernoulli distribution. The terms of the conjugate 
distribution can also be derived easily. 
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4.2.2 Beyond Mixture Models 

Mixture models are a common approach for assigning membership of observations to a set of dis- 
tinct clusters. For a finite mixture model with K mixture components, the probability of a data 
observation x„ given parameters © is 

K 

p(x n |0) = y^pkPk{*n\0k), (4.3) 

k= 1 

where Pk(-) is the probability distribution of mixture component k, and pi- is the mixing proportion. 
We can express this using indicator variables v n = [v„i,v n 2 , ■ ■ ■ , v u k\ as 

K 

p(x n |0) = E^ v «) II (Pk{xn\@k)) Vnk , (4.4) 

v„. k—1 

where v n k £ {0, 1}, 'ffk v nk = 1, and p(v n k = 1) = pk ■ If v n k = 1, then observation n belongs to 
cluster k, and therefore v n k indicates the membership of observations to clusters. 

We now consider a relaxation of this model: relaxing the constraint that v n k £ {0, 1} to instead 
be continuous-valued and removing the sum-to-one constraint. The probability in Equation (4.4) 
must now be modified, and becomes 

I 1 

p(x„|0) = / p(y n )— — TT (p fc (x„|0fc)) , '" fc dv n , (4.5) 

7 V „ Z \ v ^ & > t= l 

where we have integrated over the continuous latent variables rather than summing, and have intro- 
duced the normalizing constant Z, which is a function of v n and 0, to ensure normalization. 

By substituting the exponential family distribution (4.1) into Equation (4.5), the likelihood can 
be expressed as 

x„|v„, © ~ Expon (V VnkO^j , (4.6) 

which is obtained by combining terms in log-space and requiring the resulting distribution to be nor- 
malized. The computation of the normalizing constant Z in Equation (4.5) is thus always tractable. 
Thus, we see that the observed data can be described by an exponential family distribution with 
natural parameters that are given by the linear combination of the coefficients Ok weighted by the 
latent variables v n k- 

We consider two types of constraints on the latent variables, which give rise to two important 
model classes. These are: 

Partial membership models. The latent variables can take any value in the range v n k £ [0, 1], It 
is with this relaxation that we are able to represent data points that can belong partially to a cluster. 
Such ideas are found in fuzzy set theory, mixed membership, and topic modeling. 

Factor models. The latent variables are allowed to take any continuous value v n k £ R. Popu- 
lar models that stem from this assumption include factor analysis (FA) (Bartholomew and Knott, 
1999), probabilistic principal components analysis (PCA) (Tipping and Bishop, 1997), and prob- 
abilistic matrix factorization (PMF) (Salakhutdinov and Mnih, 2008), amongst others. The latent 
variables form a continuous, low-dimensional representation of the input data. For easier interpre- 
tation, one can restrict the latent variables to be nonnegative, allowing for a parts-based explanation 
of the data (Lee and Seung, 1999). 
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Thus, we obtain a unifying framework for many popular latent variable models, whose key 
difference lies in the nature of the latent variables used. Table 4. 1 summarizes this insight and lists 
some of the models that can arise from this framework. 

TABLE 4.1 

Models which can be derived from the unifying framework for latent variable models. 


Model 

Domain 

mixture models 

partial membership models (Heller et ah, 2008) 

exponential family PCA (Collins et ah, 2002; Mohamed et ah, 2008) 

nonnegative matrix factorization (Lee and Seung, 1999) 

Vnk £ {0} 1} 
Vnk ^ [0? 1] 
^nk ^ ^ 

Vnk G M + 


4.2.3 Bayesian Partial Membership Models 

We consider a model for partial membership that we refer to as the Bayesian partial membership 
model (BPM) (Heller et ah, 2008). The BPM is a model in which we consider observations have 
partial membership each of K classes. Consider political affiliations as an example: an individual’s 
political leaning is not wholly socialist or wholly conservative, but may have partial membership in 
both these political schools. 

At the outset it is important to note the distinction between partial membership and uncertain 
membership. Responsibilities in mixture models are representations of the uncertainty in assigning 
full membership to a cluster, and this uncertainty can often be reduced with more data. Partial mem- 
bership represents a fractional membership in multiple clusters, such as a senator with moderate 
views in between that of being fully Republican or fully Democrat. 

Figure 4.1(a) is a graphical representation of the generative process for the Bayesian partial 
membership model. The plate notation represents replication of variables and the shaded node 
represents observed variables. We denote the A’ -dimensional vector of positive hyperparame- 
ters by a. The generative model is: Draw mixture weights p fc from a Dirichlet distribution with 




(a) Bayesian partial membership 


(b) Exponential family factor analysis 


FIGURE 4.1 

Graphical models representing the relationship between latent variables, parameters, and observed 
data for exponential family latent variable models. 
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hyperparameters a, and a positive scaling factor a from an exponential distribution with hyper- 
parameter /3 > 0; then draw a vector of partial memberships v„ from a Dirichlet distribution, 
representing the extent to which the observation belongs to each of the K clusters. 

p ~ Dir(a); a~Exp(/3), (4.7) 

v n ~ Dir(ap). (4.8) 

Each cluster k is characterized by an exponential family distribution with natural parameters 0 k that 
are drawn from a conjugate exponential family distribution, with hyperparameters A and v. Given 
the latent variables and parameters, each data point is drawn from a data-appropriate exponential 
family distribution: 


0 k ~ Conj (A, v ) , (4.9) 

x„ ~ Expon v nk d k ) . (4.10) 

We denote $2 = {V, ©, p, a} as the set of unknown parameters with hyperparameters 'S' = 
{a, B, A, u}. Given this generative specification, the joint -probability is: 


p(X, n\V) = p(X|V , ©)p(V|a, p)p(©|A, v)p(p\a)p(a\/3) 

N K 

= J|p( x „|v n ,©)p(v„|a,p) Y[p{S k \X,^)p{p\a)p(a\/3). (4.11) 

n=l k = 1 


Substituting the forms for each distribution, the log joint probability is: 


N 


inp(x,n|*) = 5N VnkSk X„ + h(Xn) + g X ^nk^k 


71=1 

K 


+ [-^ 7 + vg{0k) + /(A)] 

k — i 


(4.12) 


+ 


+ 



N ^ lnT ( apk ) + EE (■ apk - 1) B\v nk 

k 7i k 

lnr(afe) + — 1) In p k + lnfe — ba. 

k 


We arrive at the BPM model using a continuous latent variable relaxation of the mixture model. 
As a result, the BPM reduces to mixture modeling when a — > 0 with mixing proportions p, and 
follows from the limit of Equation (4.8). The BPM bears interesting relationships to several well- 
known models, including latent Dirichlet allocation (LDA) (Blei et al., 2003), mixed membership 
models (Erosheva et al., 2004), discrete components analysis (DCA) (Buntine and Jakulin, 2006), 
and exponential family PCA (Collins et al., 2002; Moustaki and Knott, 2000), which we discuss 
in Section 4.2.4. Unlike LDA and mixed membership models that capture partial memberships in 
the form of attribute-specific mixtures, the BPM does not assume a factorization over attributes and 
provides a general way of combining exponential family distributions with partial membership. 


4.2.4 Exponential Family Factor Analysis 

We now consider a Bayesian model for exponential family factor analysis (EXFA) (Mohamed et al., 
2008). We can think of an exponential family factor analysis as a method of decomposing an ob- 
served data matrix X, which can be of any type supported by the exponential family of distributions. 
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into two matrices V and ®; we define the product matrix P = V®. Since the likelihood depends 
only on V and © through their product P, this can also be seen as a model for matrix factoriza- 
tion. In traditional factor analysis and probabilistic PCA, the elements of the matrix P, which are the 
means of Gaussian distributions, lie in the same space as that of the data X. In the case of EXFA and 
similar methods for non-Gaussian PCA such as EPCA (Collins et al., 2002; Moustaki and Knott, 
2000), this matrix represents the natural parameters of the exponential family distribution of the 
data. 

The generative process for the EXFA model is described by the graphical model of Figure 4. 1 (b). 
Let m and S be hyperparameters representing a A -dimensional vector of initial mean values and 
an initial covariance matrix, respectively. Let a and /3 be the hyperparameters corresponding to 
the shape and scale parameters of an inverse-gamma distribution. We begin by drawing /j, from a 
Gaussian distribution and the elements of. of the diagonal matrix S from an inverse-gamma dis- 
tribution. For each data point n of the factor score matrix V, we draw a AT-dimensional Gaussian 
latent variable v„ : 


of ~ iQ(a, (3) (4.13) 

v„ ~ J\f(v n \n, £). (4.14) 

The data is described by an exponential family distribution with natural parameters given by the 
product of the latent variables v„ and parameters Ok ■ The exponential family distribution modeling 
the data and the corresponding prior over the model parameters is: 

Ok ~ Conj (A, v) (4.15) 

x„|v n ,0 ~ Expon (J2k v nk0k) ■ (4.16) 

We denote $7 = { V. ©. //. X} as the set of unknown parameters with hyperparameters U7 = 
{m, S, a, (3, A, v}. Given this specification, in Equations (4.13)-(4.16), the log joint probability 
distribution is: 

p(X, n|¥) = p(X| V, 0)p(0| A, u)p(V l/x, E)p(/i|m, S)p(E|a, /?) 

N 

lnp(X,n|*) = £; 

n—1 
K 

+ [A T 0k + vg{0k) + /(A)] 

k = 1 

-|ln(27r)-iln|S|- 

- y ln(27r) - ^ In 151 - \{p- 
K 

+ ^ [a In /3 — In r(a) + (a — 1) In of — ficrf] , 

i=l 

where the functions h(-), g(-), and /(•) correspond to the functions of the chosen conjugate- 
exponential family distribution for the data. 

Whereas mixture models represent membership to a single cluster, and the BPM represents 
partial membership to the set of clusters, EXFA explains the data using linear combinations of 
all latent classes (an all-membership). EXFA thus provides a natural way of combining different 
exponential family distributions and producing a shared latent embedding of the data using Gaussian 
latent variables. 



^(v„-/x) T X l(v n -n) 
m ) 1 S'~ 1 (/x — m) 


^ V n kOk 


h(x n 


y ' V n kOk 


(4.17) 
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4.3 Prior to Posterior Analysis 

For both the Bayesian partial membership (BPM) model and exponential family factor analysis 
(EXFA), typical tasks include prediction, missing data imputation, dimensionality reduction, and 
data visualization. To achieve this, we must infer the posterior distribution p(fi|X, '1 / ) , by which 
we can visualize the structure of the data and compute predictive distributions. Due to the lack 
of conjugacy, analytic computation of the posterior is not possible. Although many approxima- 
tion methods exist for computing posterior distributions, we focus on Markov chain Monte Carlo 
(MCMC) because it provides a simple, powerful, and often surprisingly scalable family of methods. 
Using MCMC involves representing the posterior distribution by a set of samples, following which 
we use these samples for analysis, prediction, and decision making. 

4.3.1 Markov Chain Monte Carlo 

Markov chain Monte Carlo (MCMC) methods are a general class of sampling methods based on 
constructing a Markov chain with the desired posterior distribution as the equilibrium distribution 
of the Markov chain. MCMC methods are popular in machine learning and Bayesian statistics and 
include widely-known methods such as Gibbs sampling, Metropolis-Hastings, and slice sampling 
(Robert and Casella, 2004; Gilks et al., 1995). For sampling in the models of Sections 4.2.3 and 
4.2.4, we make use of a general purpose MCMC algorithm known as Hybrid (or Hamiltonian) 
Monte Carlo (HMC) sampling. 

Hybrid Monte Carlo (HMC), which was first described by Duane et al. (1987), is based on the 
simulation of Hamiltonian dynamics as a way of exploring the sample space of the posterior dis- 
tribution. Consider the task of generating samples from the distribution X), with being 

any relevant hyperparameters; we denote u as an auxiliary variable. Intuitively, HMC combines 
auxiliary variables with gradient information from the joint-probability to improve mixing of the 
Markov chain, with the gradient acting as a force that results in more effective exploration of the 
sample space. HMC can be used to sample from continuous distributions for which the density func- 
tion can be evaluated (up to a known constant). This makes HMC particularly amenable to sampling 
in non-conjugate settings where the full conditional distributions required for Gibbs sampling can- 
not be derived, but for which the joint probability density and its derivatives can be computed. These 
properties make HMC well-suited to sampling from the BPM and EXFA models, since these models 
do not have a conjugate structure and all unknown variables ft are continuous and differentiable, 
making it possible to exploit available gradient information. 

For HMC, a potential energy function and a kinetic energy function are defined, whose sum 
forms the Hamiltonian energy: 


7*(n,u) =£(fi|¥)+£(u), 

(Hamiltonian Energy) 

(4.18) 

£(n|¥) = -inp(n,x|¥), 

( Potential Energy) 

(4.19) 

/C(u) = — ^u T Mu. 

(Kinetic Energy) 

(4.20) 


The Hamiltonian can be seen as the log of an augmented distribution to be sampled from: 
p(X, S2 . u|\k) = p(X, S~2 j )A/"(u| 0, M), where M is a preconditioning matrix often referred to 
as a mass matrix, which in the simplest case is set to the identity matrix. The gradient of the po- 
tential energy is defined as A(fi) = d£ g^ ■ We defer further details of the physical underpinnings 
describing Hamiltonian dynamics and its appropriateness for MCMC to the work of Neal (2010) 
and Neal (1993). 

We present the full algorithm for HMC in Algorithm 1 . Each iteration of HMC has two steps. 
In the first step, we assume that an initial sample (state) for $7 is given and we generate a Gaussian 
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Evaluate Gradient g = A (6) with initial 6 //* 

[f]g = gradE( theta) 

Evaluate Energy E = £(0\ip) //* 

[f|E = hndE(theta) 
for L iterations do 

Initialize new momentum u drawn from a Gaussian 
Calculate: /C(u) = |u T u and H = £(d\ij>) + /C( u) 

6 new <r- 0 ; g new 4- g- 

for L leapfrog steps do 
u u — | g II* 

[f]Make half-step in u 

gnew < _ 0 new +tU 1 1 * 

[f]Make a step in theta 

gnew A ( gnewy 

[f] gradE( thetaNew) 
u 4 — u—^g new //* 

[f]make half step in u 

end 

E new = £(e new \fj)//* 

[f]Enew = findE( thetaNew) 

Calculate /C(u) = ^u T u 
Hamiltonian H new 4- E new + £(u) 
if rand () < exp(— (f{ new — pf)) then 
Accept True 

g <r- g new \ 9 <- e new ; E <- E new 

else 

Accept •<— False 

end 

end 

Algorithm 1: Hybrid Monte Carlo (HMC) Sampling (MacKay, 2003). 


variable u for the momentum (line 4, Algorithm 1). In the second step, we simulate Hamiltonian 
dynamics, which follow the equations of motion to move the current sample and momentum to a 
new state. The Hamiltonian dynamics must be discretized for implementation and the most popular 
discretization is known as the leapfrog method (lines 7-11). The leapfrog approximation is simu- 
lated for L steps using a step-size e. The samples f l* and u* at the end of the leapfrog steps form 
the proposed state, which is accepted using the Metropolis criterion (line 15): 

min (1, exp(— H (fi*, u*) + H{£1, u))) . (4.21) 

Finally, marginal samples from p(£l) are obtained by ignoring u. 

Aspects of Implementation 

To implement HMC correctly we must adjust the energy function to account for variables that may 
be constrained, such as variables that are nonnegative or bound between [0,1], We make use of the 
following transformations: 

BPM with a > 0, 11 k = 1 and Sfc Vnk = 1: 
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a = exp(? 7 ); 

EXFA with a\ > 0: 


_ exp(r-fc) 

71 k Efc' ex P( r fc')’ 

cl = exp(^ fc ). 


exp(a; n fc) 

E w ex P Kit')' 


The use of these transformations requires the inclusion of the determinant of the Jacobian of the 
change of variables, as well as consistent application of the chain rule for differentiation taking into 
account the change of variables. 


HMC has two tunable parameters: the number of leapfrog steps L and the step-size e. In general, 
the step-size should be chosen to ensure that the sampler’s rejection rate is between 25% and 35%, 
and to use a large number of leapfrog steps. Here we generally make use of L between 80 and 100. 
The tuning of these parameters can be challenging in some cases, and we show ways in which these 
choices can be explored in the experimental section. We fix the mass matrix to the identity but this 
can also be tuned, and we discuss aspects of this in Section 4.6. Analysis of the optimal acceptance 
rates for HMC is discussed in Beskos et al. (2010); Neal (2010) provides a great deal of guidance 
in tuning HMC samplers. 

Many datasets contain missing values, and we can account for this missing data in a principled 
manner by dividing the data into the set of observed and missing entries X = {X obs , x m * ssm9 } and 
conditioning on the set X obs during inference. In practice, the pattern of missing data is represented 
by a masking matrix, which is the indicator matrix of elements that are observed versus missing. 
Probabilities are then computed using elements of the masking matrix set to 1 . 


4.4 Related Work 

Mixed Membership Models and LDA. In general, mixed membership models (Erosheva et al., 
2004) organize the data in two levels using an admixture structure (mixture-of-mixtures model). 
Latent Dirichlet allocation (LDA) (Blei et al., 2003), as an instance of a mixed membership model, 
organizes the data at the level of words and then documents, expressing this data likelihood as a 
mixture of multinomials. LDA combines this mixture-likelihood with a /\ -dimensional Dirchlet- 
distributed latent variable v as a distribution over topics. The BPM is a similar latent Dirichlet 
model, but the latent variable represents partial memberships, and instead of a two-level structure, 
the BPM indexes the data directly using an exponential family likelihood. LDA assumes that each 
data attribute (i.e., words) of an observation (i.e., document) is drawn independently from a mixture 
distribution given the membership vector for the data point, x n( ] ~ v n kp(x\9kd )• As a result, 
LDA makes the most sense when the observations (documents) being modeled constitute bags of ex- 
changeable sub-objects (words). Furthermore, for both LDA and mixed membership models, there 
is a discrete latent variable for every sub-object, corresponding to which mixture component that 
sub-object was drawn from. This large number of discrete latent variables makes MCMC sampling 
potentially much more expensive than sampling in the exponential family models we describe here. 
A more detailed discussion and comparison of mixed and partial membership models is in the chap- 
ter by Gruhl and Erosheva (Gruhl and Erosheva, 2013, §2.4) in this volume and complements this 
discussion. 

Latent Gaussian Models. EXFA employs a /f -dimensional Gaussian latent variable v and is 
thus an example of a latent Gaussian model. This is one of the most established classes of mod- 
els and includes generalized linear regression models, nonparametric regression using Gaussian 
processes, state-space and dynamical systems, unsupervised latent variable models such as PCA, 



Partial Membership and Factor Analysis 


11 


factor analysis (Bartholomew and Knott, 1999), probabilistic matrix factorization (Salakhutdinov 
and Mnih, 2008), and Gaussian Markov random fields. In generalized linear regression (Bickel and 
Doksum, 2001), the latent variables v n are the predictors formed by the product of covariates and 
regression coefficients; in Gaussian process regression (Rasmussen and Williams, 2006), the latent 
variables v are drawn jointly from a correlated Gaussian using a mean function and a covariance 
function formed using the covariates; and in probabilistic PC A and factor analysis (Tipping and 
Bishop, 1997; Bartholomew and Knott, 1999), latent variables v n are Gaussian with isotropic or 
diagonal covariances, respectively. 

EXFA also follows as a Bayesian interpretation of exponential family PCA (Collins et al., 2002) 
and generalized latent trait models (Moustaki and Knott, 2000). Instead of fully Bayesian inference, 
these related models specify an objective function that is optimized to obtain the MAP solution. 
Similarly to the BPM, in EXFA the data is indexed directly using an exponential family distribution 
rather than through an admixture structure. With this realization though, it is easy to see the con- 
nection and extension of EXFA to a generalized mixed membership matrix factorization (MMMF) 
model by instead considering a two-level representation of the data similar to that described by 
Mackey et al. (2010). 

Both the BPM and EXFA model the natural parameters of an exponential family distribution. 
This makes them different from other latent variable models, such as nonnegative matrix factoriza- 
tion (NMF) (Lee and Seung, 1999; Buntine and Jakulin, 2006), since these alternative approaches 
model the mean parameters of distributions rather than their natural parameters. The use of natu- 
ral parameters allows for easier learning of model parameters, since these are often unconstrained, 
unlike learning for NMF which requires special care in handling constraints, e.g., leading to the 
multiplicative updates required for learning in NMF. 

Fuzzy Clustering. Partial membership is a cornerstone of fuzzy theory, and the notion that 
probabilistic models are unable to handle partial membership is used to argue that probability is a 
sub-theory, or different in character from fuzzy logic (Zadeh, 1965; Kosko, 1992). With the BPM, 
we are able to demonstrate that probabilistic models can be used to describe partial membership. 
Rather than using a mixture model for clustering, an alternative is given by fuzzy set theory and 
fuzzy fc-means clustering (Bezdek, 1981). Fuzzy fc-means clustering (Gasch and Eisen, 2002) iter- 
atively minimizes the objective function: J = )T) fc v^ l f k D 2 (pc n , Cfc), where jf > 1 is the fuzzy 
exponent parameter, v n k represents the degree of membership of data point n to cluster k, where 

v nk = 1 and D 2 (x„, Cfc) is a squared distance between the observation x ra and the cluster cen- 
tre Cfc. By varying 7 /, it is possible to attain different degrees of partial membership, with 7 / = 1 
being fc-means with no partial membership. 

We compare fuzzy clustering and the BPM in Section 4.5 and find that the two approaches 
achieve very similar results, with the advantage of probabilistic models being that we obtain esti- 
mates of uncertainty, are able to deal with missing data, and can combine these models naturally 
with the wider set of probabilistic models. Thus, we hope that this work demonstrates that, contrary 
to the common misconception, fuzzy set theory is not needed to represent partial membership in 
probabilistic models, and that this can be achieved with established approaches for probabilistic 
modeling. 


4.5 Experimental Results 

We demonstrate the effectiveness of the models presented in this chapter using synthetic datasets as 
well as a real-world case study: roll call data from the U.S. Senate. We evaluate the performance 
of the methods by computing the negative log predictive probability (NLP) on test data. The test 
sets are created by setting 10 % of the elements of the data matrix as missing data in the training set 
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and then learning in the presence of this missing data. We provide Matlab code to reproduce all the 
results in this section online. 1 


4.5.1 Synthetic Binary Data 
Noisy Bit Patterns 

We evaluate the behaviour of EXFA using a synthetic binary dataset. The synthetic data was gener- 
ated by creating three 16-bit prototype vectors, with each bit being set with probability 0.5. Each of 
the three prototypes is replicated 100 times, resulting in a dataset of 300 observations. Noise is then 
added to the data by flipping bits in the dataset with probability of 0.1 (Tipping, 1999; Mohamed 
et al., 2008). We use HMC to generate 5000 samples from the EXFA model with K = 3 factors, 
and demonstrate the evolution of the sampler in Figure 4.2. 




Log Joint Probability (Energy) 


Negative Log Predictive Probability (bits) 



Sample 


Sample 


FIGURE 4.2 

Reconstruction of data samples at various stages of the sampling in EXFA. Top two rows: Greyscale 
reconstructions at various samples and the true, noise-free data. Bottom row: Change in the energy 
function (using training data) and the corresponding predictive probability (using test data). We 
show circular markers at samples for which the reconstructions are shown above. 


1 See www.shakirm.com/code/EFLVM/. 
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Since the sampler is initialized randomly, we see that the initial samples have no discernible struc- 
ture. As the sampling proceeds, the energy rapidly decreases (the energy is the negative log joint 
probability, meaning lower is better), and useful structure can be seen after the 500th sample. By 
the end of the sampling, the samples correctly capture the true data, as seen by comparing the mean 
reconstruction computed using the last 1000 samples, and the true data in Figure 4.2. The predictive 
probability of the test data computed for every sample also decreases as the sampler progresses, 
indicating that the correct latent structure has been inferred, allowing for accurate imputation of the 
missing data. The random predictor would have an NLP = 10% x 300 x 16 = 480 bits, and 
we can see that the NLP we obtain is much lower than this. The maximum likelihood estimation 
of EXFA has NLP = 1148 bits, which is significantly worse than the Bayesian prediction. This 
is a well-known problem, since maximum likelihood estimation in this model suffers from severe 
overfitting, highlighting an important advantage of Bayesian methods over optimization methods 
(Mohamed et ah, 2008). 

Plots such as Figure 4.2 are also useful as tools for tuning an HMC sampler. For a fixed K, 
the region of high energy is fixed, so this can be used to choose a step-size and the number of 
leapfrog steps that allow us to rapidly reach this region. We fix L = 80 and tune e by monitoring 
the progression of the sampler. 

In practice, we can choose the number of latent factors K by cross-validation. To do this, we 
create 10 replications of our data and for each dataset we set 10% of the elements of the matrix 
as missing, using these elements as a held-out dataset. We then generate samples from the model 
over a range of K, and use the reconstruction error on the held-out data to choose the K that 
gives the best performance. We compare the negative log predictive probability (NLP) for K in the 
range of 2 to 20. We show the performance on the training and testing data in terms of root mean 
squared error (RMSE) as well as predictive probability (NLP) in Figure 4.3. We also compare the 
performance of the fully Bayesian approach using HMC that we presented, and the performance 
of maximum likelihood estimation in this model. The maximum likelihood estimators experience 
severe overfitting as shown by the RMSE on the training data. Since we would prefer a simpler 
model to a more complex one, we choose K = 3 vased on the graphs of RMSE and NLP on the test 
data. We discuss this issue of selecting K , and in particular, automatic methods for its selection in 
Section 4.6. 
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FIGURE 4.3 

Choosing the number of latent factors K by cross-validation. We find that K = 3 is ail appropriate 
number of latent factors. 
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Simulated Data from the BPM 

We generated a synthetic binary dataset from the BPM consisting of N = 50 points, each being a 
D = 32 dimensional vector using K = 3 clusters. We ran HMC for 4000 iterations, using the first 
half as burn-in. To compare the true partial memberships Vj- to the inferred memberships V we 
computed Ut = and U ^ = V /.V^ , which is a measure of the degree of shared mem- 

bership between pairs of observations for the true and inferred partial memberships, respectively 
(Heller et al., 2008). This measure is invariant to permutations of the cluster labels, and the range 
of entries is between [0,1]. We show image-maps of these matrices in Figure 4.4. The difference 
between entries of the true and inferred shared memberships |Ut U/J is shown in the histogram. 
The two matrices are highly similar, with 90% of entries being different from the true value by less 
than 0.2, showing that the sampler was able to learn the true partial memberships. 
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FIGURE 4.4 

Image maps showing true shared partial memberships Ut and inferred shared membership Ut for 
synthetic data generated from the BPM model. The histogram shows the percentage of entries in 
U T - U/J that fall within a given difference threshold. 


4.5.2 Senate Roll Call Data 

Having evaluated the behavior of the BPM and EXFA on synthetic data, we demonstrate their use 
in exploring membership behavior from the U.S. Senate roll call as a case study. Specifically, we 
analyze the roll call from the 107th U.S. Congress (2001-2002) (Jakulin, 2002). The data consists 
of 99 senators (one senator died in 2002, and neither he nor his replacement are included), by 633 
votes. It also includes the outcome of each vote, which we treat as an additional data point (like 
a senator who always voted the actual outcome). The matrix contains binary features for yea and 
nay votes, and abstentions are recorded as missing values. For the perspective of a political scientist 
analyzing such data, see the chapter by Gross and Manrique-Vallier (Gross and Manrique-Vallier, 
2013). 

We analyze the data using the BPM with K = 2 clusters, and show results of this analysis in 
Figure 4.5. Since there are two clusters and the amount of membership always sums to 1 across 
clusters, the figure looks the same regardless of whether we look at the ‘Democrat’ or ‘Republican’ 
cluster. The cyan line in Figure 4.5 indicates the partial membership assigned to each of the senators 
with their names overlaid. We can see that most Republicans and Democrats are clustered together in 
the flat regions of the line (with partial memberships very close to 0 or 1), but that there is a fraction 
of senators (around 20%) that lie somewhere in-between. Interesting properties of this figure include 
the location of Senator Jeffords (in magenta) who left the Republican party in 2001 to become an 
Independent who caucused with the Democrats. Also, Senator Chafee who is known as a moderate 
Republican and who often voted with the Democrats (for example, he was the only Republican 
to vote against authorizing the use of force in Iraq), and Senator Miller, a conservative Democrat 
who supported George Bush over John Kerry in the 2004 U.S. Presidential election. Lastly, it is 
interesting to note the location of the outcome data point, which is very much in the middle. This 
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FIGURE 4.5 

Analysis of the partial memberships for the 107th U.S. Senate roll call using BPM. The line shows 
the amount of membership in the ‘Democrat’ cluster with the names of Democrat senators overlaid 
in blue and Republican senators in red. 


makes sense since the 107th Congress was split 50-50 (with Republican Vice President Dick Cheney 
breaking ties), until Senator Jeffords became an Independent, at which point the Democrats had a 
one seat majority. 

We also analyzed the data using fuzzy /.'-means clustering, which found very similar rankings 
of senators to the ‘Democrat’ cluster. Fuzzy /c-means was very sensitive to the exact ranking and 
degree of partial membership, since it is highly sensitive to the fuzzy exponent parameter 7 /, which 
is typically set by hand. Figure 4.6 shows the change in partial membership for the outcome of 
the most-allegiant Democrat and Republican senator (using the result of Figure 4.5), for a range of 
values for the fuzzy exponent. The graph shows that the assigned partial membership can vary quite 
dramatically depending on the choice of 7 /. This type of sensitivity to parameters does not exist in 
the Bayesian models we present here, since they can be inferred automatically. 

The BPM provides a very natural representation of the membership of individuals in this data 
to political leanings. An alternative viewpoint can be obtained using EXFA. With EXFA, the la- 
tent variables do not have an interpretation as a degree of membership, but rather provide a low- 
dimensional embedding of the data, which for the case of two latent factors, can be used to provide 
a spatial visualization of senators. We show the results of analyzing the roll call data with EXFA in 
Figure 4.8 using K = 2 latent factors, producing 4000 samples from the HMC sampler and using 
the first half as burn-in. The latent embedding in Figure 4.8 is color-coded blue for Democrats and 
red for Republicans, and shows that there is a natural separation of the data into these two groups. 
Similarly to the BPM, we observe that most senators are clustered into a Democrat or Republican 
cluster, with a percentage who straddle the boundary between these two groups. Again, we see the 
effect of the independent candidate and the outcome. It is also important to note the connection 
between both BPM and EFA to ideal point models in political science (Bafumi et al., 2005), which 
aim to spatially represent political preferences on a left-to-right scale. Using the BPM and EFA, 
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FIGURE 4.6 

Sensitivity of partial memberships in fuzzy k- 
means with respect to the fuzzy exponent. 
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FIGURE 4.7 

Comparison of negative log predictive prob- 
abilities (in bits) across senators for BPM, 
EFA, and DPM. 


we have with Figures 4.5 and 4.8 shown Bayesian approaches of producing ID and 2D ideal point 
representations, respectively. 

As a further comparison of the BPM and EXFA, we also analyze the roll call data using a 
Dirichlet Process mixture model (DPM). We ran the DPM for 1000 Gibbs sampling iterations, 
sampling both assignments and concentration parameters. The DPM confidently finds four clusters: 
one cluster consists soley of Democrats, another solely of Republicans, a third cluster contains 9 
moderate Republicans and Democrats as well as the outcome, and the last cluster consists of a 
single senator (Hollings (D-SC)). 

We calculate the negative log-predictive probability (NLP, in bits) across senators for the BPM, 
EXFA, and DPM (Figure 4.7). We present the mean, minimum, median, and maximum NLP over all 
senators, which represents the number of bits needed to encode a senator’s voting behavior. We also 
show the outcome separately. Except for the maximum, the BPM is able to produce a more com- 
pressed representation for each senator than the DPM, showing the sensibility of inferring partial 
memberships for this data, rather than assignments to clusters. EXFA produces the most compressed 
representation, since it used unconstrained latent variables and thus has greater modeling flexibility. 
These two approaches emphasize the tradeoff between modeling efficiency and interpretability that 
must be considered when analyzing such data. 

The BPM gives an intuitive numerical quantity to the degree of membership, whereas EXFA 
gives an intuitive spatial understanding of this membership. Factor models are also often used to 
model the covariance structure of data and provide further insight into the data. For Gaussian data, 
this covariance is given by 0© T . For non-Gaussian data, we can compute the marginal covariance 
p(xi = Xj),i f j, by Monte Carlo integration using the posterior samples obtained. We show this 
in Figure 4.8(b) for the first 30 votes. The figure shows that there are many roll calls that are highly 
correlated, e.g., the first 14 entries represent the opening of the congress and are votes for chairs 
of various committees. Often not being votes of contention, there is highly correlated voting for 
these motions. Analysis of this matrix gives insight into the evolution of votes in the congress and 
provides an example of some of the probabilistic queries that can be made once the posterior samples 
are obtained. Other interesting probabilistic queries of this nature include examining the similarity 
of senators using the KL-distance between their latent posterior distributions, or examining the 
influence of senators to the voting outcomes using the marginal likelihood each senator contributes 
to the total probability. 
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FIGURE 4.8 

Analysis of the partial memberships for the 107th U.S. Senate roll call using EXFA. (a) The left plot 
shows the latent embedding produced using two latent factors, (b) The right plot shows the marginal 
covariance between votes. 


4.6 Discussion 

Having gained an understanding of exponential family latent variable models and their behavior, we 
now consider some of the questions that affect our ability to use such models in practice. Questions 
that arise include: how to decide between competing models, methods for choosing the latent di- 
mensionality K, difficulty in tuning the MCMC samplers, and obstacles in applying these models 
to large datasets. We expand on these questions and discuss the ways in which our models can be 
extended to address them. 

Choice of Model. In this chapter we have considered mixture models, the Bayesian partial 
membership model, exponential family factor analysis, and mixed membership models. The choice 
of one model type over another depends on the whether the modeling assumptions made match 
our beliefs regarding the process that generated the data, as well as the aim of our modeling effort, 
whether for visualization, predictive, or explanatory purposes. The BPM and EXFA are models 
with a single layer of latent variables that we showed are relaxations of A'-component mixture 
models. These models thus make use of a single layer of latent variables, and we demonstrated in 
the experiments that the models allowed for de-noising of data, effective imputation of missing data, 
and are useful tools for visualization of high-dimensional data. The structure of the models proved 
to be intuitive and flexible, and appropriate for the tasks we presented. 

More flexible versions of these models can be obtained by considering the mixed membership 
analogues of the BPM and EXFA, such as Grade of Membership models (Erosheva et ah, 2007; 
Gross and Manrique-Vallier, 2013) and mixed membership matrix factorization (Mackey et ah, 
2010), respectively. In addition, other prior assumptions may be needed; sparsity is one such prior 
assumption that has gained importance and the inclusion of sparsity in the models discussed here is 
described by Mohamed et ah (2012). Galyardt (2013) in this volume shows that mixed membership 
models have an equivalent representation as a mixture model, with a number of mixture compo- 
nents polynomial in K , thus providing a highly efficient representation of high-dimensional data. 
Inference in these more complex models is harder due to the increased number of latent and as- 
signment variables, making the factors affecting our choice of model based on the tradeoff between 
simplicity, flexibility, and the computational complexity of the available models. A formal model 
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comparison would rely on Bayesian model selection, in which the ‘best’ model is chosen based on 
the evaluation of the marginal likelihood or model evidence (Carlin and Chib, 1995). 

Choosing the Latent Dimensionality. In Section 4.5 we used cross-validation to determine 
the appropriate dimensionality of the latent variables. Ideally, we would wish to learn K auto- 
matically using the training data only. An alternative approach to cross-validation is by Bayesian 
model selection where we evaluate and compare the marginal likelihood or evidence for various 
models, e.g., as described by Minka (2001) for probabilistic PCA. The models we have described 
can also be adapted to include the determination of K as part of the learning algorithm. Bishop 
(1999) exploited sparsity by employing automatic relevance determination (ARD), which uses a 
large number of latent factors and sets to zero any factors that are not supported by the data; K is 
then the number of non-zero columns at convergence of the algorithm. It is also possible to specify 
the dimensionality of the latent variables as part of our model construction. This approach requires 
an efficient means of sampling in spaces with changing dimensionality, most often achieved by 
trans -dimensional MCMC, such as the approach described by Lopes and West (2004). More recent 
approaches have focused on the construction of nonparametric latent factor models using the In- 
dian buffet process or other nonparametric priors to automatically adapt the dimensionality of latent 
variables (Knowles and Ghahramani, 2010; Bhattacharya and Dunson, 2011). 

Tuning MCMC Samplers. We made use of the standard approach for Hybrid Monte Carlo 
(HMC) sampling here, but this can be improved to increase the number of uncorrelated samples 
obtained. We used an identity mass matrix, but adaptively estimating the mass matrix using the 
empirical covariance or Hessian of the log joint probability from the samples during the burn-in 
phase can be used, reducing sensitivity to the choice of step-size e (Atchade et ah, 2011). 

Using an appropriate mass matrix allows proposals to be made at an appropriate scale, thus 
allowing for larger step-sizes during sampling. But estimation of the mass matrix (and computing 
its inverse) can add significantly to the computation involved in HMC. Adaptive tuning of the mass 
matrix was also shown using the Riemann geometry of the joint-probability by Girolami and Calder- 
head (201 1). Another way of improving HMC was proposed in Shahbaba et al. (201 1), and involves 
splitting the Hamiltonian in a way that allows much of the movement around the state-space to be 
done at low computational cost. Tuning the HMC parameters can be challenging, especially for the 
non-expert, and methods now exist for the automatic tuning of HMC’s parameters (Hoffman and 
Gelman, 2011; Wang et ah, 2013). Any of these approaches removes the need for tuning HMC and 
have the promise of making the application of HMC much more general purpose. 

Deterministic Approximations for Large-scale Learning. With the increasing size of datasets, 
the availability of scalable inference is an important factor in the practial use of many models. 
MCMC methods can be shown to scale well to large datasets (Salakhutdinov and Mnih, 2008). De- 
terministic approximations are increasingly used in the development of scalable algorithms and can 
allow better exploitation of the distributed nature of modem computing environments. Variational 
inference for LDA was described by Teh et ah (2007), and such an approach can be applied to the 
BPM. For latent Gaussian models, approximate inference methods such as integrated nested Laplace 
approximations (INLA) (Rue et ah, 2009) have been proposed. INLA is effective for models whose 
latent variables are controlled by a small number of hyperparameters, limiting the application of this 
approach for learning in EFA. Variational methods for EXFA have also been successfully explored 
(Khan et ah, 2010). 


4.7 Conclusion 

In this chapter, we have described a principled Bayesian framework for latent variable model- 
ing that is generalized to the exponential family of distributions. We began with the widely-used 
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mixture model and showed that a relaxation of the assumption that each data point belongs to one 
and only one cluster allows us to explore different aspects of the structure underlying the data. We 
obtained the Bayesian partial membership (BPM) model by allowing the latent variables to rep- 
resent fractional membership in multiple clusters, and obtained exponential family factor analysis 
(EXFA) by considering continuous latent variables (which explain contributions to the data using a 
linear combination from all clusters). By framing these models in the same latent variable frame- 
work, we exploited the continuous nature of the unknown parameters and demonstrated how Hybrid 
Monte Carlo can be implemented and tuned for such models. We also described the connection to 
other latent variable and mixed membership models. Using both synthetic and real-world data, we 
demonstrated the use of these models for visualization and predictive tasks and the wide range of 
insightful probabilistic queries that can be made using these models. 
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One issue with parametric latent class models, regardless of whether or not they feature mixed mem- 
berships, is the need to specify a bounded number of classes a priori. By contrast, nonparametric 
models use an unbounded number of classes, of which some random number are observed in the 
data. In this way, nonparametric models provide a method to infer the correct number of classes 
based on the number of observations and their similarity. 

The following chapter seeks to provide mathematical and intuitive understanding of nonpara- 
metric mixed membership models, focusing on the hierarchical Dirichlet process mixture model 
(HDPM). This model can be understood as a nonparametric extension of the Grade of Member- 
ship model (GoM) described by Erosheva et al. (2007). To elucidate this relationship, the Dirichlet 
mixture model (DM) and Dirichlet process mixture model (DPM) are first reviewed; many of the 
interesting properties of these latent class models carry over to the GoM and HDPM models. 

After describing these four models, the HDPM model is further explored through simulation 
studies, including an analysis of how the model parameters affect the model’s clustering behavior. 


89 



90 


Handbook of Mixed Membership Models and Its Applications 


An overview of inference procedures is also provided with a focus on Gibbs sampling and varia- 
tional inference. Finally, some example applications and model extensions are briefly reviewed. 


5.1 Introduction 

Choosing the appropriate model complexity is a problem that must be solved in almost any statistical 
analysis, including latent class models. Simple models efficiently describe a small set of behaviors, 
but are not flexible. Complex models describe a wide variety of behaviors, but are subject to over- 
fitting the training data. For latent class models, complexity refers to the number of groups used to 
describe the distribution of observed and/or predicted data. One strategy is to fit multiple models of 
varying complexity then decide among them with a post-hoc analysis (e.g., penalized likelihood). 
Nonparametric mixture models provide an alternate strategy which bypasses the need to choose the 
correct number of classes. The hallmark of nonparametric models is that their complexity increases 
stochastically as more data are observed. The rate of accumulation is determined by various tuning 
parameters and the similarity of observed data. 

One of the best-known examples of nonparametric Bayesian inference is the Dirichlet process 
mixture model (DPM), a nonparametric version of the Dirichlet mixture model (DM). The DM 
model assumes that the population consists of a fixed and finite number of classes and it therefore 
bounds the number of classes used to represent any sample. By contrast, a DPM posits that the 
population consists of an infinite number of classes. Of these, some finite but unbounded number of 
classes are observed in the data. Because the number of classes is unbounded, the model always has 
a positive probability of assigning a new observation to a new class. 

Both Dirichlet mixtures and Dirichlet process mixtures assume that observations are fully ex- 
changeable. Extensions for both models exist for situations in which full exchangeability is inappro- 
priate. This may be the case when multiple measurements are made for individuals in the sample. 
For example, one may consider a survey analysis in which each individual responds to several items. 
In this case, one expects two responses to be more similar if they come from the same individual. 
The Grade of Membership model (GoM) adapts the DM model for partial exchangeability (Erosheva 
et ah, 2007). In the GoM model, two responses are exchangeable if and only if they are measured 
from the same individual. Like the DM model, it bounds the number of classes. The GoM model is 
known as a mixed membership model or individual-level mixture model because each individual in 
the sample is associated with unique mixing weights for the various classes. 

The hierarchical Dirichlet process mixture model (HDPM) extends the GoM model in the same 
way that the Dirichlet process mixture model extends the DM model. As with the GoM model, the 
HDPM model assumes that responses are exchangeable if and only if they come from the same 
individual. Whereas the GoM assumes that the population consists of a fixed and finite number of 
classes, the HDPM posits an infinite number of classes. Thus, the HDPM model does not bound the 
number of classes used to represent the sample. 

All four models (DM, DPM, GoM, and HDPM) cluster observations into various classes where 
class memberships are unobserved. They are distinguished by the type of exchangeability (full or 
partial) and whether or not the number of classes in the population is bounded a priori. 

Nonparametric mixture models, such as the DPM and HDPM models, have several intuitive ad- 
vantages. Because the number of classes is not fixed, they provide a posterior distribution over the 
model complexity. Posterior inference includes a natural weighting of high-probability models of 
varying complexity. Hence, uncertainty about the “true” number of classes is measurable. Further- 
more, because the number of classes is unbounded, nonparametric models always include a positive 
probability that the next observation belongs to a previously unobserved class. This property is es- 
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pecially nice when considering predictive distributions. If the number of classes is unknown, it is 
possible that the next observation will be unlike any of the previous observations. 

This chapter aims to provide an intuitive understanding of the hierarchical Dirichlet mixture 
model. Sections 5.2 and 5.3 begin with the fully exchangeable models, showing how the DM model 
is built into the nonparametric HDPM model by removing the bound on the number of classes. 
Properties of these mixtures are illustrated and compared using the Chinese restaurant process. This 
relationship forms the foundation for exploring properties of the GoM and HDPM models in Sec- 
tions 5.4 and 5.5. In Section 5.6, the role of tuning parameters for the HDPM model is explored 
intuitively and illustrated through simulations. Section 5.7 provides an overview of inference strate- 
gies for DPM and HDPM models. Section 5.8 reviews some example applications with Section 5.9 
devoted to brief descriptions of some model extensions. 


5.2 The Dirichlet Mixture Model 

Suppose a sample contains n observations, (x \ , . . . ,x n ), where Xt is possibly vector-valued. A 
latent class model assumes that each observation belongs to one of K possible classes, where K 
is a finite constant. Observations are conditionally independent given their class memberships, but 
dependence arises because class memberships are not observed. As a generative model, each Xi is 
drawn by randomly choosing a class, say z t , then sampling from the class-specific distribution. 

Denote the population proportions of these K classes by tt = (ni , . . . , ttk) and the distribution 
of class k by F^. For simplicity, assume that these distributions belong to some parametric family, 
{F(-|0) : 9 £ 0}. Therefore, F k = F(-\9k), where 9^ £ 0 denotes the class-specific parameter for 
class k. Given the class proportions and parameters, the latent class model is described by a simple 
hierarchy: 


Zi \tt Mult(7r) i = 1 . . .n. 

Xi\zi,9 ~ F(-\9 Zi ) i = l...n. 

Here, Mult(7r) is the multinomial distribution satisfying P (z* = k) = ttu for k = 1 ... K. 

Inferential questions include learning the mixing proportions { tt / ;; J , class parameters (9k), and 
possibly the latent class assignments (z,). Uncertainty about the class proportions and parameters 
may be expressed through prior laws. In the Dirichlet mixture model (DM), the class proportions 
have a symmetric Dirichlet prior, tt ~ Dirfo/ A'). This distribution is specified by the precision 
a > 0 and has the density function 


/(7t|a) 


r(q) 

[T(a/K)) K 


fur - 1 

fc = 1 


(5.1) 


wherever 7r is a A'-tlimensional vector whose elements are non-negative and sum to 1. The range of 
possible values for tt is known as the (K — 1) -dimensional simplex or more simply the ( K — 1)- 
simplex. The expected value of the symmetric Dirichlet distribution is the uniform probability vector 
Fj [tt] = . The precision specifies the concentration of the distribution about this mean, 

with larger values of a translating to less variability. 

More generally, an asymmetric Dirichlet distribution is defined by a precision a > 0 and a mean 
vector 7Tq = E[tt\ = (7Toi, . . . , ttqk) in the ( K — l)-simplex. If 7r ~ Dir(a, 7To), then its distribution 
function is 
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f(n\a,n 0 ) 


r (Ef=i a 7 r o fe) 

nf=i r ( a7r ofc) 


n^- 1 - 


*;=! 


(5.2) 


Note that the symmetric Dirichlet Dir(a/A') is equivalent to the distribution Dir(a, /. 1), where 1 
denotes the vector of ones. 

The symmetric Dirichlet prior influences the way in which the DM model groups observations 
into the K possible classes. To finish specifying the model, each class is associated with its class 
parameter, 6 The class parameters are assumed to be i.i.d. from some prior distribution if (A), 
where A is a model-specific hyperparameter. This results in the following model: 


Dirichlet Mixture Model 


tt\a i' 




Ok A 

- H{ A) 

k = 1.. 

. . K. 

Zi\tT '■ 

Mult(7r) 

2 = 1.. 

. . n. 

| %iiO r 

-n-\o Zi ) 

2 = 1.. 

. . n. 


Because class memberships are dependent only on a, the Dirichlet precision fully specifies the prior 
clustering behavior of the model. The hyperparameter A only influences class probabilities during 
posterior inference. A priori, observations are expected to be uniformly dispersed across the K 
classes and a measures how strongly the prior insists on uniformity. 


5.2.1 Finite Chinese Restaurant Process 1 

Imagine a restaurant with an infinite number of tables, each of which has infinite capacity. Obser- 
vations are represented by customers and class membership is defined by the customer’s choice of 
dish. All customers at a particular table eat the same dish. When a customer sits at an unoccupied ta- 
ble, he selects one of the K possible dishes for his table with uniform probabilities of 1/ K. Because 
multiple tables may serve the same dish, the class membership of an observation must be defined 
by the customer’s dish rather than his table. 

The first customer sits at the first table and randomly chooses one of the K dishes. The second 
customer joins the first table with probability 1/(1 + a) or starts a new table with probability a/(l + 
a). As subsequent customers enter, the probability that they join an occupied table is proportional 
to the number of people already seated there. Alternatively, they may choose a new table with 
probability proportional to a. 

Mathematically, let T denote the number of occupied tables when the nth customer arrives and 
let t n denote the table that he chooses. Given the seating arrangement of the previous customers, 
the probability function for t n is 

/(f„Kt s )cx { Ei<n 1( ii = U, \ n n ^ T T+l , (5.3) 

where denotes the table assignments of all but the nth customer and 1 (ti = t n ) is the indicator 
function, which is equal to 1 if ti = t n and 0 otherwise. 

Since all customers at a table eat the same dish, if a customer joins an occupied table, he eats 

1 The typical Chinese restaurant process, as described in Section 5.3.2, illustrates the clustering behavior of the Dirichlet 
process mixture model after integrating out the unknown vector of class proportions (7r). Here, the modified finite version 
describes a Dirichlet mixture by fixing the number of possible dishes at K < oo. 
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whatever dish was previously chosen for that table. When a customer starts a new table, he must se- 
lect a dish for that table by randomly choosing one of the K menu items with uniform probabilities. 
Therefore, the distribution for the ?r.th customer’s dish is 


f(z n \a,Zn) 


^2ii<n 1 ( z i — Zn ) + a /K 
n — 1 + a 


k < K, 


(5.4) 


where z, is the dish (class membership) for the ith customer and z;, is the vector of dishes for all 
but the nth customer. 

The Chinese restaurant analogy depicts the clustering behavior of the Dirichlet mixture model. 
Notably, tables with many customers are more likely to be chosen by subsequent customers. This 
creates a “rich-get-richer” effect. The marginal distribution of z n is uniform due to the symmetric 
Dirichlet prior, but the conditional distribution given (z ±, . . . ,z n - 1 ) is skewed toward the class 

l(z=k) 

memberships of previous customers. Let /edf(^) = ‘ denote the empirical distribution 

of the first n — 1 class memberships. Equation 5.4 can be written as a weighted combination of the 
empirical distribution and the uniform prior: 


f(z n \a, Zjj) OC (n - 1 )/eDf(Zt») + ol—. (5.5) 

Equation (5.5) shows the smoothing behavior of the Dirichlet mixture when the class proportions 
(7t) are integrated out. Specifically, the class weights for the nth observation are smoothed toward 
the average value of X K . The Dirichlet precision a controls the degree of smoothing. It has the effect 
of adding a prior observations spread evenly across all K classes. Because the class memberships 
are fully exchangeable, this equation expresses the conditional distribution for any z, based on the 
other class memberships by treating x, as the last observation. 

Recall that the customers’ dishes represent their class memberships. To completely specify the 
mixture distribution, each dish is associated with a parameter value, 9 f. H. In other words, each 
customer that eats dish k represents an observation from the /.'th class with class parameter Or- The 
class parameters are mutually independent and independent of the latent class memberships. 

An important property of the DM model is that Z; is bounded by K. At most, K classes will be 
used to represent the n observations in the sample. The next section explores the behavior of this 
model when the bound is removed. 


5.3 The Dirichlet Process Mixture Model 

Consider the issue of deciding how many classes are needed to represent a given sample. One 
method is to fit latent class models for several values and use diagnostics to compare the fits. Such 
methods include, among others, cross-validation techniques (Hastie et ah, 2009) and penalized like- 
lihood scores such as Akaike information criterion (AIC) (Akaike, 1973) and Bayesian information 
criterion (BIC) (Schwarz, 1978). Though AIC and BIC are popular choices, their validity for latent 
class models has been criticized (McLachlan and Peel, 2000). Instead of choosing the single best 
model complexity, one can use a prior distribution over the number of classes to calculate poste- 
rior probabilities (Roeder and Wasserman, 1997). Given a suitable prior, reversible jump Markov 
chain Monte Carlo techniques can sample from a posterior distribution which includes models of 
varying dimensionality (Green, 1995; Giudici and Green, 1999). An alternate strategy is to assume 
that the number of latent classes in the population is unbounded. The Dirichlet process mixture 
model (DPM) arises as the limiting distribution of Dirichlet mixture models when K approaches 
infinity. This limit uses a Dirichlet process as the prior for class proportions and parameters. This 
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section reviews properties of the the Dirichlet process and its relationship to the finite Dirichlet mix- 
ture model. These properties elucidate the hierarchical Dirichlet process (Section 5.5), which uses 
multiple Dirichlet process priors to construct a nonparametric mixed membership model. 

5.3.1 The Dirichlet Process 

The Dirichlet process is a much-publicized nonparamettic process formally introduced by Fergu- 
son (1973). It is a prior law over probability disttibutions whose finite-dimensional marginals have 
Dirichlet distributions. Dirichlet processes have been used for modeling Gaussian mixtures when 
the number of components is unknown (Escobar and West, 1995; MacEachern and Muller, 1998; 
Rasmussen, 1999), survival analysis (Kim, 2003), hidden Markov models with infinite state-spaces 
(Beal et al., 2001), and evolutionary clustering in which both data and clusters come and go as time 
progresses (Xu et al., 2008). 

The classical definition of a Dirichlet process constructs a random measure P in terms of finite- 
dimensional Dirichlet disttibutions (Ferguson, 1973). Let ct be a positive scalar and let H be a 
probability measure with support 0. If P ~ DP (a, H) is a Dirichlet process with precision a and 
base measure H, then for any natural number K, 


(P{A 1 ),...,P(A k )) ~Dir(aH{A 1 ),...,aH(A K )), (5.6) 

whenever is a measurable finite partition of 0. 

Sethuraman (1994) provides a constructive definition of P based on an infinite series of inde- 
pendent beta random variables. Let <j>k Beta) I , a) and Or H be independent sequences. 

Define 7Ti = <j>\, and set nr = 4>k nU (i - <t>j) f ° r k > 1. The random measure P = Y^kLi n kSg k 
has distribution DP(a, II), where S x is the degenerate distribution with f(x) = 1. This definition 
of the Dirichlet process is called a stick-breaking process. Imagine a stick of unit length which is 
divided into an infinite number of pieces. The first step breaks off a piece of length 7ti = <j>\. After 
k — 1 steps, the remaining length of the stick is ]^[ ~' (1 — (f>f). The fcth step breaks off a fraction 
4>k of this length, which results in a new piece of length 7 r^. 

The stick-breaking representation shows that P ~ DP (a, H ) is discrete with probability 1. 
The measure P is revealed to be a mixture of an infinite number of point masses. Hence, there is 
a positive probability that a finite sample from P will contain repeated values. This leads to the 
clustering behavior of the following Dirichlet process mixture model (Antoniak, 1974): 

P ~ DP(a, H). 

0*\P l ~' P i = l...n. 

Xi\0*i ~ F(-\0*) i = l...n. 

Let (0i, ... ,9k) denote the unique values of the sequence (#*, . . . , $*), where K is the random 
number of unique values. Set Zi such that 0* = 9 Zi . Given z, and 9, the distribution of the 7th 
observation is 


F(x i \z i ,9) = F(-\9 Zi ). (5.7) 

By comparing Equation (5.7) to the Dirichlet mixture model, one can interpret z., : as a class member- 
ship, 9 as the class parameters, and K as the number of classes represented in the sample. Note that 
K is random in this model, whereas it is a constant in the DM model. Therefore, the DPM model 
provides an implicit prior over the number of classes in the sample. Antoniak (1974) specifies this 
prior explicitly. 

To make direct comparisons between Dirichlet process mixtures and Dirichlet mixtures, it is 
useful to disentangle the disttibutions of 7r and 9k ■ Let 7r ~ SBP(o) denote the vector of weights 
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based on the stick-breaking process. Extend the notation Mult(7r) to include infinite multinomial 
distributions such that P (zi = k) = tt/,. for all positive integers k. The DPM model is equivalent to 
the following hierarchy: 

Dirichlet Process Mixture Model 


7r|a r. 

- SBP(a). 


e k \\ 1 ' 

- H{ A) 

k= 1,2,.. 

Zi\lT 

Mult(7r) 

i = 1 . . . n. 

Xi\zi,e - 


i = 1 . . . n. 


The above hierarchy directly shows the relationship between the Dirichlet mixture model and the DP 
mixture model. Where the Dirichlet mixture uses a symmetric Dirichlet prior for 7 r, the DP mixture 
uses the stick-breaking process to generate an infinite sequence of class weights. In fact, Ishwaran 
and Zarepour (2002) shows that the marginal distribution induced on xi , . . . , x n by the DM model 
approaches that of the DPM model as the number of classes increases to infinity. Thus, the Dirichlet 
process mixture model may be interpreted as the infinite limit of finite Dirichlet mixture models. 


5.3.2 Chinese Restaurant Process 

A Chinese restaurant process illustrates the clustering behavior of a Dirichlet process mixture model 
when the unknown class proportions ( 7 r) are integrated out (Aldous, 1985). Customers arrive and 
choose tables as in the finite version for Dirichlet mixtures, but the menu in the full Chinese restau- 
rant process contains an infinite number of dishes. 

Recall that in the finite Chinese restaurant process, a customer who sits at an empty table chooses 
one of the K available dishes using uniform probabilities. For a DP mixture, the menu has an 
unlimited number of dishes. The discrete uniform distribution is not defined over infinite sets, but 
this technicality can be sidestepped. Class parameters are assigned independently of each other and 
the enumeration of the dishes is immaterial. Therefore, dishes do not need to be labeled until after 
they are sampled. Whatever dish happens to be selected first can be labeled 1, the second dish to 
be chosen can be labeled 2, and so on. In other words, when sampling a dish, there is no need to 
distinguish between any of the unsampled dishes. Because there are finitely many sampled dishes 
and infinitely many unsampled dishes, a “uniform” distribution implies that, with probability 1, 
the customer selects a new dish from the distribution H( A). Note that if the distribution H has any 
points with strictly positive probability, there is a chance that the “new” dish chosen by the customer 
will be the same as an already observed dish. To avoid this technicality and simplify wording, one 
may assume that H is continuous. The mathematics are the same in either case. 

In the Chinese restaurant process, the first customer sits at the first table and randomly chooses a 
random dish, which is labeled 1 . The second customer joins the first table with probability 1 / ( 1 + 0 ) 
or starts a new table with probability a/(l + a). As subsequent customers enter, the probability that 
they join an occupied table is proportional to the number of people already seated there. Alterna- 
tively, they may choose a new table with probability proportional to a. 

Suppose that there are T occupied tables when the nth customer enters. The probability dis- 
tribution for the nth customer’s table, t n , is the same as in the finite Chinese restaurant process 
(Equation 5.3). In contrast, the distribution for his dish, z n , is slightly different because each table 
has a unique dish. Let K be the current number of unique dishes: 


(z n = k\a,Zn) = 


Hzj=k) 

n— 1+q: 
a. 

n— 1+a ’ 


k < K 
k = K + 1 


Again, z- n denotes the vector of dishes (class assignments) for all but the nth customer. 


(5.8) 
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To finish specifying the mixture, each dish is associated with a class parameter drawn indepen- 
dently from the base measure H. As in the finite Chinese restaurant process, the class parameter for 
the nth observation can also be written as a weighted combination of the current empirical distribu- 
tion and the prior distribution: 


F{0 n ) tx (n - 1 )/edf + aH, (5.9) 

where /edf = Yhi< n ■ / ( n — 1) denotes the empirical distribution of the first n — 1 class mem- 
berships. Note that the only difference from the finite Chinese restaurant is that Equation (5.9) uses 
the prior distribution H in place of the prior uniform probability j, of Equation (5.5). 

The Chinese restaurant process illustrates how the population (restaurant) has an infinite number 
of classes (dishes), but only a finite number are represented (ordered by a customer). Note that each 
customer selects a random table with probabilities that depend on the precision a, but not the base 
measure H. Hence, the choice of a amounts to an implicit prior on the number of classes. Antoniak 
(1974) specifies this prior explicitly. Notably, the number of classes increases stochastically with 
both n and a. In the limit, as a approaches infinity, each customer chooses an unoccupied table. As 
a result, there is no clustering and each observation belongs to its own unique class. The distribution 
of (d Zl , . . • , 0 Zn ) approaches an i.i.d. sample from H. In the other extreme, as a approaches zero, 
each customer chooses the first table, resulting in a single class. In effect, the population distribution 
is no longer a mixture distribution. 

5.3.3 Comparison of Dirichlet Mixtures and Dirichlet Process Mixtures 

Both the DM model and the DPM model assume that observed data are representatives of a finite 
number of latent classes. The chief difference being that the DM model places a bound on the num- 
ber of classes while the DPM model does not. Ishwaran and Zarepour (2002) makes this relationship 
explicit: as the number of classes increases to infinity, the DM model converges in distribution to 
the DPM model. Because the number of classes is unbounded, there is always a positive probability 
that the next response represents a previously unobserved class. The DPM model is an example 
of a nonparametric Bayesian model, which allows model complexity to increase as more data are 
observed. 

A comparison of Equations (5.8) and (5.4) reveals how the nonparametric DPM model dif- 
fers from the bounded-complexity DM model. In both models, the distribution of an observation’s 
class is simply the empirical distribution of the previous class memberships, plus additional a prior 
observations. However, the prior weight is distributed differently. In the DM model, the a prior ob- 
servations are placed uniformly over the K classes. Once all I\ classes have been observed, there 
is no chance of observing a novel class. In the DPM model, the a prior observations are placed on 
the next unoccupied table, which will serve a new dish with probability 1 (if the base measure H 
is continuous.) Hence, there is always a non-zero probability that the next observation belongs to a 
previously unobserved class, though this probability decreases as the sample size increases. While 
the DPM model allows greater flexibility in clustering, both models yield the same distribution for 
the observations and class parameters when conditioned on the vector of class memberships. 

DP mixtures, and other nonparametric Bayesian models, are one strategy for determining the 
appropriate model complexity given a set of data. The theory behind these mixtures states that there 
is the possibility that some classes have not been encountered yet. For prediction, as opposed to 
estimation, this flexibility may be especially attractive since the new observation may not fit well 
into any of the current classes. 

Recall that the precision a amounts to a prior over the number of classes. The posterior dis- 
tribution of the latent class memberships provides a way to learn about the complexity from the 
observations that does not require choosing a specific value. Furthermore, it is possible to expand 
the DPM model to include a hyperprior for a (Escobar and West, 1995). 
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5.4 Mixed Membership Models 

In the mixture models of Sections 5.2 and 5.3, each individual is assumed to belong to one of K 
underlying classes in the population. In a mixed membership model, each individual may belong 
to multiple classes with varying degrees of membership. In other words, each observation is asso- 
ciated with a mixture of the K classes. For this reason, mixed membership models are also called 
individual-level mixture models. Mixed membership models have been used for survey analysis 
(Erosheva, 2003; Erosheva et al., 2007), language models (Blei et al., 2003; Erosheva et al., 2004), 
and analysis of social and protein networks (Airoldi et al., 2008). This section focuses on models 
where K is a finite constant, which bounds the number of classes that may be observed. 

Consider a population with I\ classes. Let 77(A) denote the prior over class parameters where 
A is a hyperparameter. In the DM model, individual i has membership in a single class, denoted 
z t . Alternatively, the 7th individual’s class may be represented as the A-dimensional vector = 
( 7i i i , . . . , 7r,if), where 7r,fc is 1 if Zi = k and 0 otherwise. By contrast, a mixed membership model 
allows ni to be any non-negative vector whose elements sum to 1. The range of tt, is called the 
( K — l)-simplex. Geometrically, the simplex is a hyper-teUahedron in R K . A mixed membership 
model allows to take any value in the simplex while the DM model constrains ", to be one of the 
K vertices. 

The Grade of Membership model (GoM) extends the Dirichlet mixture model to allow for mixed 
membership (Erosheva et al., 2007). Both models can be understood as mixtures of the K possible 
classes. The DM model has a single population-level mixture for all individuals. In the GoM model, 
the population-level mixture provides typical values for the class weights, but the actual weights 
vary between individuals. As with the DM model, the population-level mixture in the GoM model 
has a symmetric Dirichlet prior. This mixture serves as the expected value for the individual-level 
mixtures, which also have a symmetric Dirichlet distribution. Denote the Dirichlet precision at the 
population level by «q and the precision at the individual level by a. The GoM model can be 
expressed by the following hierarchy: 


7T 0 «0 ^ 

- Dir(a 0 /7C). 


A '■ 

- 77(A) 

k = 1 ... K. 

X, O' , 7Tq ' 

^ d ' Dir(a7r 0 i, . . .,aTT 0K ) 

i = 1 . . . n. 


K 


X l \l Ti, 6 r- 

- J ^2 7r ijF(-\dk) 

i = 1 . . . n. 


fc= l 


This model is the same as the DM model, except for the individual-level mixture proportions. The 
Dirichlet mixture model constrains mk to zero for all but one class, so the distribution of x, is a 
“mixture” of one randomly selected component. By contrast, in a mixed membership model, tt, can 
take any value in the ( K — 1) -simplex resulting in a true mixture of the K components. 

Clearly, the GoM model generalizes the A -dimensional DM model by allowing more flexibility 
in individual-level mixtures. Conversely, the GoM model can also be described as a special case of a 
larger DM model. Suppose each individual is measured across J different items. (For simplicity of 
notation, assume J is constant for all individuals; removing this restriction is trivial.) Erosheva et al. 
(2007) provides a representation theorem to express a A'-class GoM model as a constrained DM 
model with K J classes. Therefore, this theorem will assist in building the GoM model into a non- 
parametric model in much the same way that Section 5.3 built the DM model into the nonparametric 
DPM model. 

Erosheva et al. describe their representation theorem in the context of survey analysis. In this 
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case, a sample of n people each respond to a series of J survey items and an observation is the 
collection of one person’s responses to all of the items. Let 7 r* = (tth , . . . , 7 t^k) be the membership 
vector and let x, = (xn , . . . , x, j) be the response vector for the ith individual. (Henceforth, the ith 
observation shall be explicitly denoted as a vector because scalar observations do not naturally fit 
with the representation theorem.) According to the GoM model, the distribution of x, is a mixture 
of the K class distributions with mixing proportions given by tt,. Alternatively, the GoM model can 
be interpreted as a Dirichlet mixture model in which individuals can move among the classes for 
each response. That is, individual i may belong to class z, ; - in response to item j, but belong to a 
different class z, :/ - in response to item j*. The probability that individual 1 behaves as class k for 
a particular item is 7 Tjfc. Note that this probability depends on the individual, but is constant across 
all items. Let z Z j denote the class membership for the ith individual in response to the jth item. 
The distribution of the ith individual’s response is determined by z,; = (zn, . . . , z t j ) and the class 
parameters (6). Therefore, individual i may be considered a member of the latent class z, with class 
parameter (0 Zil , ■ . . , 0 ZiJ ). Each of the J components takes on one of K possible classes, making 
the GoM model a constrained DM model with K J possible classes. The constraints arise because 
7 Tj is constant across all items in the GoM model, whereas the DM model allows all probabilities 
to vary freely. Thus, the probability of class z» is constant under permutation of its elements in the 
GoM model but not the DM model. 

The representation theorem suggests augmenting the GoM model with the collection of latent 
individual-per-item class memberships: 


Grade of Membership Model 


7To ao r 

- Dir (a 0 /K). 



0*1 A 1 ' 

- H( A) 

k = 1. 

. . K. 

7Ti | Of, 7To ' 

Dir(a7 roi, . . . ,an 0 K) 

i = 1 . 

. . n. 

%ij 1 7Ti r 

^ Mult (77, ) 

i = 1 . 

. . n, j 

%ij | Zij , 0 n 


i = 1 . 

■ ■ n, j 


As with the DM model, each measurement (a 'if) is generated by randomly choosing a latent class 
membership then sampling from the class-specific distribution. In both models, responses are as- 
sumed to be independent given the class memberships. The responses in the DM model are fully 
exchangeable because each one uses the same vector of class weights. The GoM model includes 
individual-level mixtures that allow for the individual’s class weights to vary from the population 
average. Therefore, responses are exchangeable only if they belong to the same individual. Note 
that Zij is a positive integer less than or equal to K. The next section builds this latent class repre- 
sentation into a nonparametric model by removing this bound on the value of z z j. 


5.5 The Hierarchical Dirichlet Process Mixture Model 

Table 5.1 illustrates two analogies that may help elucidate the hierarchical Dirichlet process mixture 
model (HDPM). Comparing the columns reveals that the relationship between the HDPM model and 
the DPM model is similar to the relationship between the GoM model and the DM model. Recall 
that the GoM model introduces mixed memberships to the DM model by introducing priors for 
individual-level mixtures that allow them to vary from the overall population mixture. In the same 
way, the HDPM adds mixed memberships to the DPM model through individual-level priors. In 
both the GoM and HDPM models, the population-level mixture provides the expected value for 
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the individual-level mixtures. Comparing the rows of Table 5.1 shows that the relationship between 
the HDPM and GoM models is similar to the relationship between the DPM and DM models. 
In both cases, the former model is a nonparametric version of the latter model that arises as a 
limiting distribution when the number of classes is unbounded. The HDPM and DPM models are 
specified mathematically by replacing the symmetric Dirichlet priors of the GoM and DM models 
with Dirichlet process priors. 



Number of Classes 

Exchangeability 

Bounded 

Unbounded 

Full 

DM 

DPM 

Partial 

GoM 

HDPM 


TABLE 5.1 

The relationship among the four main models of this chapter. 

The hierarchical Dirichlet process mixture model (HDPM) incorporates a Dirichlet process for 
each individual, P,; ~ DP (cr, p, ) , where the base measure /’, is itself drawn from a Dirichlet pro- 
cess, P 0 ~ DP (op, II ) ■ Thus, the model is parametrized by a top-level base measure, //, and two 
precision parameters, ao and a. 


P 0 \a 0 ,H ~DP(a 0l H). 


Pi\ a jPo 1 

- DP (a,P 0 ) 

i = 1 . 

. . n. 


0ij\ Pi' 

- Pi 

i = 1 . 

■n, j = 1- 

..J. 

Xij\9*j - 

-pi-m 

i = 1 . 

■■n, j = 1 . . 

..J. 


Note that P[P,] = Po- Thus the population-level mixture, Po, provides the expected value for the 
individual-level mixtures and the precision a influences how closely the P, s fall to this mean. 

A stick-breaking representation of the HDPM model allows it to be expressed as a latent class 
model. Since Po has a Dirichlet process prior, it can be written as a random stick-breaking measure 
Po = Y^kL i nokSook, where 7To ~ SBP(ao) and do is an infinite i.i.d. sample with distribution H. 
Likewise, each individual mixture P; can be expressed as Pi = Y^kLi where u* ~ SBP(a) 

and 6i is an infinite i.i.d. sample from P 0 . Because 9i k ~ Po, it follows that each 9 lk £ 9 q- Therefore, 
Pi = Y^kLi 7r ikSg ok - where n ik = YpLi Kj 1 (0*j = #o k)- Po specifies the set of possible class 
parameters and the expected class proportions; P, allows individual variability in class proportions. 
Since the class parameters are the same for all individuals, the notation 9ok may be replaced by the 
simpler 9 k . While 7To may be generated using the same stick-breaking procedure used in Section 5.3, 
the individual-level 7TjS require a different procedure given by Teh et al. (2006). Given 7r 0 and 9 , let 

<, Pik ~ Beta(a7r ofc , a (l - Yj=i i) ) ■ Define 7r a = 0 a , and set n ik = (f> ik IIy=i ( x ~ <Poj) for 

k > 1. The random measure Pi = YT=i n ik^6 k h as distribution DP(a, Po). Denote the conditional 
distribution of tt^tto by SBP2(a,7To). The latent class representation of the HDPM model is as 
follows: 


Hierarchical DP Mixture Model 


7r 0 |ao, H ^ 

- SBP(a 0 ). 




Ok\X 1 ' 

- H{ A) 
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%ij 1 ^ 

Mult(7Tj) 

i = 1 . , 

■ ■ n, j = 1 

... J. 




100 


Handbook of Mixed Membership Models and Its Applications 


%ij | Zij , 0 ^ Ff | @Zij ) — I ... 71) j — I ... J. 

The three lowest levels in the HDPM model (pertaining to tt,;, z r3 , and Xij ) represent the item- 
level distribution. Individual i, in response to item j, chooses a class according to its unique 
mixture: Zij ~ Multf^,). Its response is then given according to the distribution for that class: 

~ F(-\0 Zi ). This behavior is the same as the Grade of Membership model. Each individual is 
associated with a unique mixture over a common set of classes. The individual’s mixture defines 
the probability that individual i behaves as class k in response to item j. Since this probability does 
not depend on j, two responses are exchangeable if and only if they come from the same individual. 
Unlike the GoM model, the HDPM model does not bound the number of classes a priori. 

5.5.1 Chinese Restaurant Franchise 

Teh et al. (2006) uses the analogy of a Chinese restaurant franchise to describe the clustering be- 
havior of the hierarchical Dirichlet process. As each individual has a unique class mixture, each is 
represented by a distinct restaurant. Each of the customers represents one of the individual’s fea- 
tures. For example, in the context of survey analysis, a customer is the individual’s response to one 
of the survey items. The restaurants share a common menu with an infinite number of dishes to 
represent the various classes in the population. Each restaurant operates as an independent Chinese 
restaurant process with respect to seating arrangement, but dishes for each table are chosen by a 
different method which depends on the entire collection of restaurants. 

Let denote the jth customer at the ;th restaurant. When the customer enters, he chooses a 
previous table based on how many people are sitting there, or else starts a new table. The distribution 
for the table choice is the same as the Chinese restaurant process. It is given by Equation (5.3), 
taking t n to denote the new customer’s table and ti,.. . , t n -\ to denote the previous customers’ 
tables, where the numbering is confined to tables at restaurant i. 

If the customer sits at a new table, he must select a dish. As with table choice, the customer will 
choose a previously selected dish with a probability that depends on how popular it is. Specifically, 
the probability is proportional to the number of other tables currently serving the dish across the 
entire franchise. Alternatively, with probability proportional to op, the customer will choose a new 
dish. 

Suppose there are T tables currently occupied in the entire franchise when a customer decides 
to sit at a new table, becoming the first person at table T + 1. Denote the dish served at table t, by 
dt and let K denote the current count of unique dishes. If a new table is started, the distribution for 
the next dish, c?t+i is 


f SZfll l(rff — fc) i , jr 

P(d T+ l = k\d 1 ,...,d T ) = l an T+«o ’ Izk + 1 ■ (5 ' 10) 

l T+a o’ 

Note that the customer has three choices: he may join an already occupied table (i.e., choose 
locally from the dishes already being served at restaurant i)\ start a new table and choose a previous 
dish (i.e., choose a dish from the global menu); or start a new table and select a new dish. Let Zij 
denote the dish chosen by the Jth customer at restaurant i. The Chinese restaurant franchise shows 
that the distribution of Zij is comprised of three components: 


(■ zu = k\a 0 ,a.z~) = 


r Ej = l g T.T=1 l(dt=fc) 

° J-l + a "t" J-l+a T+a 0 

a. cup 

J— 1+a T+ckq ’ 


k < I< 
k = K + 1 


(5.11) 


where T is the current number of occupied tables, K is the current number of distinct dishes across 
the entire franchise, and z ry is the set of all dish assignments except for z,j. Note that the weight 
for a dish k < K is the sum of the number of customers at restaurant i eating that dish plus a 
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times the number of tables in the franchise serving that dish. In other words, dishes are chosen 
according to popularity, but popularity within restaurant i is weighted more heavily. In practical 
terms, measurements from the same individual share information more strongly than measurements 
from multiple individuals. The precision a specifies the relative importance of the local-level and 
global-level mixtures. 

To finish specifying the HDPM model, each dish is associated with a class parameter drawn 
independently from the base measure H. The distribution of 6 ZiJ is a three-part mixture. Let F lm i = 
^ j<J J{J — 1) be the empirical distribution of class parameters based on the customers at 
restaurant i. Let F Pop = 5g d /T be the empirical distribution based on the proportion of tables 
serving each dish across the entire restaurant. The distribution of 9 ZiJ is 

F (■ 9 ZiJ \a,a o ,0 z ~ ) oc (J - 1 )F Ind + aTF Pop + aa 0 H, (5.12) 

where 9 Z ~ denotes the class parameters for all customers except for the Jth customer of the zth 
restaurant. As with the other three mixture models, DM, DPM, and GoM, the responses are assumed 
to be independent given the class memberships. Hence, Equations (5.11) and (5.12) can be applied 
to any customer by treating him as the last one. 

Equation (5.12) illustrates the role of the two precision parameters. Larger values of a place 
more emphasis on the population-level mixture, so that individual responses will tend to be closer 
to the overall mean. Meanwhile, larger values of a 0 place more emphasis on the base measure H. 
This specifies the prior distribution of 0 ZiJ . After observing the response x,j. the class weights are 
updated by the likelihood of observing Xij within each class. Unfortunately, this model requires 
fairly complicated bookkeeping as to track the number of customers at each table and the number of 
tables serving each menu item. Blunsom et al. (2009) proposes a strategy to reduce this overhead. 
Inference is discussed more fully in Section 5.7. 

5.5.2 Comparison of GoM and HDPM models 

The relationship between the HDPM and GoM models is very similar to the relationship between 
the DP and DPM models described in Section 5.3.3. Both the GoM and HDPM models assume 
that observed data are representatives of a finite number of classes. Whereas the DM and DPM 
models assume that the observations are fully exchangeable, the GoM and HDPM models are mixed 
membership models that treat some observations as more similar than others. For example, in survey 
analysis, two responses from one individual are assumed to be more alike than responses from two 
different individuals. In text analysis, the topics within one document are assumed to be more similar 
than topics contained in different documents. 

The HDPM model is similar to the GoM model as both combine individual-level and population- 
level mixtures. The chief difference between the GoM and HDPM models is that the GoM model 
bounds the number of classes a priori while the HDPM model does not. Teh et al. (2006) makes 
this relationship explicit. It shows that the HDPM model is the limiting distribution of GoM models 
when the number of classes approaches infinity. Because the number of classes is unbounded, there 
is always a positive probability that the next response represents a previously unobserved class. 

The clustering behaviors of the HDPM and GoM models are very similar. In the GoM model, the 
individual-level mixtures ( 7 ^) are shrunk toward the overall population-level mixture (7To), which is 
itself shrunk toward the uniform probability vector, (-^=, . . . , j^). The Dirichlet process priors in 
the HDPM model exhibit a similar property, where the individual-level P , s are shrunk toward the 
overall population-level mixture. If. Whereas the GoM model shrinks 7To toward the uniform prior 
over the K classes, the HDPM model shrinks P 0 toward a prior base measure II. The result is that 
the HDPM model always maintains a strictly positive probability that a new observation is assigned 
to a new class. This is illustrated by the Chinese restaurant franchise, with the exact probability 
of a new class given by Equation (5.11). In other words, the DM and GoM models place a finite 
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bound on the observed number of classes, but the DPM and HDPM models are nonparametric 
models that allow the number of classes to grow as new data are observed. While the HDPM allows 
greater flexibility in clustering than the GoM model, both models yield the same distribution for the 
observations and class parameters, when conditioned on the vector of class memberships. 

Recall that the precision a in the DPM model amounts to a prior over the number of observed 
classes. In the HDPM model, this prior is specified by two precision parameters: ao at the popula- 
tion level and a at the individual level. The posterior distribution of the latent class memberships 
provides a way to learn about the complexity of the data without choosing a specific value. Teh 
et al. (2006) extends the HDPM model with hyperpriors for both a 0 an d a to augment the model’s 
ability to infer complexity from the data. Section 5.6 explores the role of ao and a intuitively with 
simulation studies for illustration. 


5.6 Simulated HDPM Models 

The Chinese restaurant process can be used to construct simple simulations of HDPM models. 
Such simulations can reveal how the the model is affected by changes in the tuning parameters or 
sample size. Specifically, the simulations in this section illustrate the behavior of mixtures at both 
the population level and individual level, as well as the similarity between different individuals. 

5.6.1 Size of the Population-Level Mixture 

Figure 5.1 shows how the number of population classes in the HDPM model is affected by sample 
size and the two precision parameters. The values result from simulation of a Chinese restaurant 
franchise in which each restaurant receives 16 customers (e.g., each individual responds to 16 survey 
items.) With a and n held fixed, the average size (number of components) of the population mixture 
increases as ao increases from 1 to 100. The average mixture also increases with a when n and ao 
are fixed, but the difference is not significant except when ao = 100. Thus, both precisions affect 
the expected number of classes, but ao may limit or dampen the effect of a. Intuitively, a large value 
of a causes more customers to choose a new table, but a low value for ao means that they frequently 
choose a previously ordered dish. Hence, new classes will be encountered infrequently. Indeed, as 
ao approaches 0, the limit in the number of classes is 1, regardless of the value of a. Standard errors 
in mixture size were also estimated by repeating each simulation 100 times. The effect of ao and a 
on the standard error is similar to the effect on the mean mixture size (see Figure 5.2). 

5.6.2 Size of Individual-Level Mixtures 

Figures 53-5.4 show that the effect of the HDPM parameters on the size of the individual mixtures 
is similar to their effect on the population mixture size. As seen in Figure 5.3, the average number 
of classes in each individual mixture increases with both a- and ao- The first third of the chart comes 
from simulations in which each individual responds to 16 survey items. For the second third, there 
are 64 items per individual, and there are 100 items per individual in the last third. There is a clear 
interaction between the precision parameters and the number of survey items. The size of individual 
mixtures increases with the number of responses per individual. In terms of the Chinese restaurant 
franchise, more tables are needed as more customers enter each restaurant. Interestingly, this effect is 
influenced by the two precision parameters. The mixture size increases more dramatically for large 
values of a and ao- Figure 5.4 shows how the standard deviation among the size of the individual 
mixtures is affected by the number of survey items and the precision parameters. The effect on 
variability is similar to, but weaker than, the effect on average mixture size. In most cases, the 
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variability in the size of individual mixtures is quite small compared to the average. Note that this 
analysis compares only the number of classes represented by the individuals. This does not take into 
account the number of shared dishes between restaurants nor the variability in their class proportions 
(the 7TjS). 

5.6.3 Similarity among Individual-Level Mixtures 

In the HDPM model, each individual is associated with a unique mixture of the population classes. 
The similarity among the individuals is influenced by both ao and a, though in opposite manners. 
Large values of a lead to individual mixtures being closer to the population, and hence, each other. 
Therefore, similarity among the individuals tends to increase with a. On the other hand, similarity 
tends to decrease as cto increases. Intuitively, the number of classes in the entire sample tends to be 
small when ao is small. Hence, individual mixtures select from a small pool of potential classes. 
This leads to high similarity between individuals. When a is large, individual mixtures select from 
a large pool of potential classes and tend to be less similar. Indeed, as ao approaches infinity, the 
class parameters tend to behave as i.i.d. draws from the base measure H. If H is non-atomic, then 
the individual mixtures will not have any components in common. In the other extreme, as ao 
approaches zero, the number of classes in the sample tends to 1. This results in every individual 
“mixture” being exactly the same, with 100% of the weight on the sole class. 



Number of Individuals 

FIGURE 5.1 

The effect of prior precisions and the number of individuals on the expected population mixture 
size in HDPM models. The population mixture size is the number of classes represented across all 
individuals in response to J = 16 measurements. Error bars represent two standard errors. Estimates 
are based on 100 simulations of each model. 
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FIGURE 5.2 

Standard errors for the population mixture size for various sample sizes and precisions in the HDPM 
model, based on 100 simulations per model. 



J 16 64 100 


FIGURE 5.3 

The effect of prior precisions and number of measurements per individual on the average individual- 
level mixture size. The mixture size represents the number of classes an individual represented 
during J measurements. Based on 100 simulations per model. 
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FIGURE 5.4 

Standard deviation in individual-level mixture size for various HDPM models. Mixture size mea- 
sures the number of classes an individual represented during J measurements. Based on 100 simu- 
lations per model. 
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The effect of the two precision parameters on similarity is shown in Figure 5.5 based on 100 
simulations of n = 50 individuals responding to J = 16 items. There are several reasonable choices 
for measuring similarity. Here, the similarity between two individuals is defined by 


Sim(i, i') 


y k 

Z^fc= i 


min {n ik ,ni'k) 


J 


(5.13) 


where J is the number of items per individual (16 in this case), K is the total number of classes in 
the sample, and riik is the number of times individual i responded to a survey item as a member of 
class k. In effect, Sim(i, i') counts the number of times both individuals represented the same class, 
after arranging the second individual’s responses to maximize the overlap with the first individual. 
Rearrangement is valid because responses from each individual are exchangeable. 

Note that none of the heatmaps in Figure 5.5 exhibit any strong structure. This is expected under 
the HDPM model since the individuals are conditionally independent given the population-level 
mixture. On the other hand, one can easily see that the similarity among individuals in a particular 
model increases as a increases and as o 0 decreases. This exactly matches the intuition explained 
above. 



FIGURE 5.5 

The effect of HDPM precision parameters on the similarity of individual-level mixtures. Darker 
areas correspond to higher similarity. Similarity is averaged across 100 simulations. 


5.7 Inference Strategies for HDP Mixtures 

Two broad categories of inference strategies for nonparametric mixture models are Markov chain 
Monte Carlo (MCMC) sampling and variational inference. Sampling techniques have the advan- 
tage of converging to the correct answer, at least under certain circumstances. Unfortunately, these 
techniques often require a great deal of computation time and it can be very difficult to assess con- 
vergence. Convergence for variational inference can be achieved quickly and assessed easily, but at 
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the cost of some bias. Some simulation experiments have shown that the bias is not too drastic, at 
least when H is in the exponential family (Blei and Jordan, 2006). 

5.7.1 Markov Chain Monte Carlo Techniques 

Escobar and West (1995) demonstrate a Gibbs sampling scheme to estimate the posterior distribu- 
tion for the DPM model, including inference for the precision term a. They directly sample from 
f(0 Zn \9 Z ~ , a), where 9 Z ~ denotes the class parameters for all observations except the nth one. Un- 
fortunately, Markov chains built on this representation are slow to converge. In order for a class 
parameter to change, each member of that class must move to a new or different class one at a time. 
Thus, in order to remove a class or create a new one, there are low-probability intermediate states in 
which observations are in their own class. A more efficient strategy is to represent 9 Zrl as the class 
parameters (9k) and class memberships ( Zi ) (MacEachern, 1994) . This strategy is sometimes called 
the “collapsed” Gibbs sampler. Class assignments can be updated by combining the prior probabili- 
ties from the Chinese restaurant process with the likelihood of x z given the class parameters. Let K 
denote the current number of classes in the mixture. For k < K + 1, let /cRp(fc) be the probability 
that z n = k conditioned on the rest of the class memberships under the Chinese restaurant process 
(Equation 5.8). The probability that observation n should be assigned to class k < K is 

P (z n = k\a , Zn) oc fcRp{k)f(x n \O k ), (5.14) 

where z;, denotes all class assignments except for z n . The probability that x n should be assigned to 
a new class is 


P(Zn = K + l|a,z s ) cx f CRP (K + 1) [ f(xi\9)dH{9). (5.15) 

Je 

Since the observations are exchangeable, these equations can be used for any z t by treating x t as 
the last observation. Once the class membership vectors are updated, the class parameter 9 k can 
be updated from the posterior distribution given the prior H and the set of observations currently 
assigned to class k, denoted by A k = {i '■ z z = k}: 

f(0k) « f(xi\9 k )dH(9 k ). (5.16) 

In cases where two or more classes share similar structure, Jain and Neal (2004) proposes a “split- 
merge algorithm” that allows larger jumps in MCMC updates. This algorithm uses a Metropolis- 
Hastings step to potentially split one class into two or merge two classes into one. For the DPM 
model, MCMC sampling is fairly straightforward if the base measure H{9) is conjugate to F(-\9). 
This conjugacy is important for two reasons. First, the probability of moving .r, to a new class de- 
pends on the integral f e f(xj\6)dH(6). Second, in the collapsed Gibbs sampler, conjugacy leads to 
simple updates of 9 k given the observations in class k. Strategies for non-conjugate H include the 
“no gaps” algorithm, which augments the latent class representation with empty classes (MacEach- 
ern and Muller, 1998), and a split-merge algorithm for non-conjugate base measures (Jain and Neal, 
2007). 

For the HDPM model, Gibbs sampling is more complex due to the larger amount of bookkeep- 
ing required. In order to update z z j, it is necessary to keep track of how many tables have dish k, 
how many customers are at each of those tables, and which restaurant the tables are in. This can 
lead to heavy memory requirements in large datasets. Blunsom et al. (2009) proposes a more effi- 
cient representation based on the idea of histograms. For each dish k and each positive integer m, 
they simply maintain a count of how many tables with m customers are serving dish k. This rep- 
resentation takes advantage of the exchangeability properties of the HDPM model. Due to the fact 
that responses are independent given the latent classes, it does not matter which table a customer 
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actually sits at. When a customer joins a table, the appropriate bin count is decremented and the 
bin above is incremented. For example, if a customer is assigned to table 9, which has two previous 
customers, then there are now three customers at table 9. Thus, there is one fewer table with two 
customers and one more table with three customers. When a customer leaves a table, the opposite 
happens. The appropriate bin is decremented and the bin below is incremented. 

Once the mechanism for implementing the Chinese restaurant franchise is decided, MCMC 
sampling can proceed as in the DPM model. That is, the latent class assignments (zij) and class 
parameters (Ok ) can be alternately updated. Let K be the current number of classes. For k < K + 1, 
let ,/crf ( & ) be the probability that z n = k given the rest of the class assignments under the Chinese 
restaurant franchise (Equation 5.11). The probability that observation n should be assigned class 
k < K is 


P(z n = k\a, a 0 , z~j) oc fcRF{k)f(x n \0 k ), (5.17) 

where zry denotes all class assignments except for z, j. The probability that x„ should be assigned 
to a new class is 


P (z n = I< + l\a,a 0 ,Zfj) oc /crfW [ f (x n \0)dH (0) . (5.18) 

Je 

Note that the updates are the same as in the DPM model, except that fcRp(k) is replaced by fcRp(k). 

Since the observations are independent given the latent class assignments, these equations can 
be used for any by treating X{ 3 as the last observation. Once the class parameters are updated, 
the class parameter 0 k can be updated in the same way as in the DPM model. Namely, the new 
value of 0 k is randomly generated from its posterior distribution given the prior H and the set of 
observations assigned to class k as in Equation (5.16). 

5.7.2 Variational Inference 

Variational inference can be viewed as an extension of the expectation maximization algorithm 
(EM) (Beal, 2003). Whereas EM uses an iterative approach to find a point estimate for some vector 
of unobserved variables (e.g., latent variables and parameters), variational inference attempts to 
approximate their entire posterior distribution. 

Let 0 and z be the sets of model parameters and latent variables. In DPM and HDPM models, 
direct calculation of f(0, z|x) is impractical due to the intractable calculation of the data marginal. 
The intractability arises from the complex interactions among parameters and latent variables. The 
variational approach is to constrain the posterior to some simpler family of variational functions that 
treat these values as independent. The posterior is approximated by finding the variational function 
closest to the true posterior (e.g., in KL divergence). Because the variational functions break the 
dependence between some variables, it is possible to minimize the divergence by iteratively opti- 
mizing one piece of the function at a time, given the rest of the function. For example, one may 
constrain f(0, z|x) to be of the form qg(0\x.) ■ q z ( z|x). This can be optimized using coordinate 
ascent by iteratively updating qg and q z based on the value of the other function. 

Blei and Jordan (2006) provides an explicit algorithm for DP mixtures when the base measure 
H is exponential family . Teh et al. (2008) describes a variational approach for hierarchical models 
that can be used for mixed membership models. The latent variables in the DP mixture are the 
class proportions (n k ), class parameters (0 k ), and class assignments (z k ). Rather than work with 
the class proportions, Blei and Jordan work directly with ( <p k ), the beta random variables from 
the stick-breaking process. In order to update the variational functions, they also limit the number 
of components in the variational function to a finite number, say T. However, they optimize the 
KL-divergence between this truncated stick-breaking measure and the full DP posterior with infinite 
components. This yields a set of variational functions parametrized by: 
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T—l T n 

q{<t>,6,z) = n )> (5- 19 ) 

t— 1 t= 1 i=l 

where each g 7t is a beta distribution, each q Tt is in the same family as the prior H, and each q p . (zt) is 
multinomial. Notice that the variational function for each variable is the same family as its marginal 
under the true posterior, however the variational function treats all variables as independent. 

The variational function updates proceed like posterior updates given the data and the current 
value of the other functions. Blei and Jordan (2006) provide explicit updates for each function in 
the case where H is exponential family. They compared this variational algorithm to the collapsed 
Gibbs sampler and found that the log-probability of held-out data was similar, but that the varia- 
tional approach required less computation time. Furthermore, the computation time for variational 
inference did not increase dramatically in the range of 5- to 40-dimensional observations. 

Variational inference is even more efficient if some dimensions of the parameter space can be 
integrated out. For example, if inferential goals do not include recovering the full mixture posterior, 
it is possible to integrate out the mixing proportions (Kurihara et al., 2007). This still allows pos- 
terior analysis of class membership and parameters as well as calculation of a lower bound for the 
data marginal. Teh et al. (2008) extends this collapsed algorithm to hierarchical Dirichlet process 
mixtures. 

One of the advantages of nonparametric models is that they allow the complexity of the model 
to grow as new data are observed. This property may be especially advantageous for streaming 
applications, for which new data continually arrive. Online variational inference algorithms have 
been developed for mixed membership models (Canini et ah, 2009; Hoffman et ah, 2010; Rodriguez, 
201 1) including the HDPM model (Wang et ah, 201 1). 

5.7.3 Hyperparameters 

The parameters for the DPM and HDPM models include the precisions for Dirichlet processes at 
each level of mixing and possible hyperparameters for the base distribution H. For example, if H 
is a normal distribution, a hyperprior may be used to learn about its mean and variance. Typically, 
hyperpriors are at least used for the precision parameters ao and a, since inference can be sensitive 
to these choices. For example, in one of the first practical applications of the DPM model, Escobar 
and West (1995) show that the posterior distribution over K is quite sensitive although the predictive 
distribution is robust. To decrease sensitivity, they recommend using diffuse gamma hyperpriors for 
precision parameters. Gamma hyperpriors are convenient because the induced posterior for a given 
the data and latent variables depends only on the number of classes. Thus, the value of a can be 
updated efficiently based on the current value of the other latent and observed variables. 


5.8 Example Applications of Hierarchical DPs 

Erosheva et al. (2007) applies the GoM model to data from the National Long Term Care Survey. 
Alternatively, the HDPM model provides a nonparametric approach to the same data. For each indi- 
vidual, the survey contains binary outcomes on 6 “Activities of Daily Living” (ADL) and 10 “Instru- 
mental Activities of Daily Living” (IADL). ADL items include basic activities required for personal 
care, such as eating, dressing, and bathing. IADL items include basic activities necessary to reside in 
the community such as doing laundry, cooking, and managing money. Positive responses (disabled) 
to each item signify that during the past week the activity was not completed or not expected to be 
completed without the assistance of another person or equipment. Each survey response is regarded 
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as an independent Bernoulli random variable: X t3 ~ Bern(dy), where X,j is the response of the 
ith individual on the jth item and Ojj is the probability of a positive response. In this context, a 
mixture model asserts that the population consists of various sub-groups with varying probabilities 
of a positive (disabled) response. For example, the population may contain healthy, mildly disabled, 
and disabled cohorts with increasing probabilities of positive responses. 

The GoM model combines mixture models for both the population and individual level. The 
individual-level mixture asserts, for example, that an individual may behave as a member of the 
healthy cohort in response to item 1, but behave as a member of the disabled cohort in response 
to item 2. Each individual is associated with unique mixture probabilities. The population mixture 
defines the overall proportions of the various cohorts across all individuals and items. 

Replacing the GoM model with the HDPM model yields a similar structure, except that the 
number of classes does not need to be specified a priori. 

Blei et al. (2003) presents a mixed membership model for modeling documents called latent 
Dirichlet allocation (LDA). The classes are various topics (e.g., computer science, operating sys- 
tems, and machine learning). Each topic is considered a multinomial distribution over some finite 
vocabulary. The class parameters are the multinomial proportions, which are smoothed using a 
Dirichlet prior. Each document in the sample is associated with a unique mixture of topic propor- 
tions. A word is generated by selecting a topic from the document-level mixture, then choosing a 
word from the topic-specific multinomial. As with the Grade of Membership model, the number of 
classes (topics) must be specified a priori. Alternatively, one can use a hierarchical Dirichlet process 
mixture, in which the number of potential topics is countably infinite (Teh et al., 2006). Under this 
nonparametric mixed membership model, each new word has a positive probability of belonging to 
a new topic. Hoffman et al. (2008) uses a similar model to measure musical similarity, where the 
documents are musical pieces and the “topics” are features. 

5.8.1 The Infinite Hidden Markov Model 

In the hidden Markov model, a sequence of observations (xi,X 2 , • ■ • , x n ) are explained by a second 
sequence of latent variables (yi, t/ 2 , . . . , y n ). The latent sequence is modeled by a Markov chain and 
the observation (or emission) at time t is assumed to depend only on the state of the chain at time 
t. Hidden Markov models assume fixed finite numbers for both the number of latent states and the 
number of possible emissions. Each state s is associated with a vector of transition probabilities, 
7 tJ = (7rJ 1; . . . where nf k = ¥(y t+ i = k\yt = s); and a vector of emission probabilities, 

nf = (ttJi, . . . , 7 tJ v ), where nf v = ¥{x t = v\y t = s). 

A hidden Markov model can be specified as a mixed membership model by taking the latent 
states as the possible classes. The vectors nj and irf define mixtures over the state-space and 
emission space; since yt+i and x t are conditionally independent given y t , one may consider each 
mixture separately. Denoting the number of possible states by K, a mixed membership model for 
the transitions can be defined by Dirichlet priors: 

tto ~ Dir {al/K). 

nj Dir(a T • 7r^) s = 1 . . . K. 

Vt+Avt ~ Mult(7r,y t ) t = 1 . . . n. 


The state-dependent vectors, 7r J, allow each state to have unique transition probabilities, which are 
shrunk toward the population-level weights, ttq. Separate Dirichlet priors can be used to define a 
mixed membership model for emissions, with V denoting the number of possible values: 
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7T o ~ Dir(af /V). 

1- ~' Dir(a E • ttq) s = 1 . . . K. 

Xt\y t ~ Mult(7r^) t = l...n. 


As with the transition vectors, each state has unique emission probabilities, which are shrunk to- 
ward the population averages. Beal et al. (2001) developed a nonparametric version of HMMs by 
replacing the Dirichlet priors with Dirichlet process priors. As there are a countable infinite number 
of potential states and emissions, they call this model the infinite hidden Markov model (iHMM). 
The authors apply this model to a language processing problem. The latent state y t denotes a topic 
which specifies a multinomial distribution for the tth word. Because both the transition and emis- 
sion models are nonparametric, there is a non-zero probability that the Markov chain transitions to 
a new topic, or that a topic produces a previously unobserved word. 


5.9 Other Nonparametric Mixed Membership Models 
5.9.1 Multiple-Level Hierarchies 

The four main models in this chapter: DM in Section 5.2, DPM in Section 5.3, GoM in Section 5.4, 
and HDPM in Section 5.5 all produce exchangeability within any given mixture. The individuals 
(xj) are exchangeable in all models and the per-item responses (x tJ ) are also exchangeable in 
the GoM and HDPM models. If this exchangeability structure is unrealistic or undesired, one way 
to introduce dependence is to include multiple levels of hierarchy. For example, in the National 
Long Term Care Survey, some responses concern “Activities of Daily Living” and others concern 
“Instrumental Activities of Daily Living.” In theory, an individual’s class membership probabilities 
could vary depending on the sub-category. This can be modeled by including an extra layer of 
Dirichlet process mixing with S denoting the number of sub-categories: 


P 0 ~ DP(a 0 , H). 


Pi\Po ' 

- DP(ai ,P 0 ) 

i = 1 . 

. . n. 


Pis\Pi n 

- DP(a 2 , P i:j ) 

s = 1 . . 

..S. 


@isj | Pis n 

^ Pis 

i = 1 . 

. . n, s = 1 . 

..S, j = 1 . . . J. 

X isj | @i s j r 

- p(msj) 

i = 1 . . 

. . n, s = 1 . 

■■S, j = 1 ... J. 


This model includes mixtures at the population level ( Po), at the individual level (Pi), and at the 
sub-category level for each individual (Pi S ). The degree to which any two responses share informa- 
tion is determined by how many hierarchy levels separate them (as well as the relevant precision 
parameters). As before, responses are fully exchangeable within each mixture. 

A double hierarchy may also be appropriate if individuals come from multiple sub-populations. 
For example, one could divide individuals based on type of residence: apartment, house, or nursing 
home. In this case, the HDPM model would include mixtures at the population level, sub-population 
level (type of residence), and individual level. Furthermore, if items are divided into various cate- 
gories, then it is possible to include a fourth level of the hierarchy to account for this. In theory, any 
number of levels is possible, although the number of latent variables needed to represent a mixture 
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model grows with each new level. Teh et al. (2006) provides an example application of a three-level 
HDPM model that they use to analyze articles from the proceedings of the Neural Information Pro- 
cessing Systems (NIPS) conference from the years 1988 to 1999. The articles are divided into nine 
sections, such as “algorithms and architectures” and “applications.” Documents from the same sec- 
tion are expected to have a similar distribution of topics. Their model incorporates topic mixtures at 
the document level, section level, and population level. Here, the population is the entire collection 
of documents. The section level mixture allows a document to share information about topics more 
closely with documents within the same section than with documents in other sections. 

5.9.2 Dependent Dirichlet Processes 

In some cases, the problem is not to create a desired exchangeability structure but to induce corre- 
lation between different mixture components. This may be the case when covariates are measured 
for each observation. Exchangeability implies that there is no a priori difference among the possible 
covariate values. This may be appropriate for nominal variables such as gender or ethnicity. In this 
case, the effects of the covariate can be accounted for using additional hierarchy levels as described 
above. On the other hand, exchangeability may not be appropriate for ordinal or continuous vari- 
ables, such as years of experience or age. Dependent Dirichlet processes have been developed for 
these types of covariates. For example, spatial Dirichlet processes have been used when the “indi- 
viduals” are points in space (Gelfand et al., 2005; Duan et al., 2007). Such models produce Dirichlet 
process mixtures at each point, such that the mixtures are more similar when points are closer to- 
gether. Temporal versions of dependent Dirichlet processes have also been developed which allow 
a nonparametric mixture to evolve over time (Xu et al., 2008; Ahmed and Xing, 2008). Although 
applications of dependent Dirichlet processes have focused on extensions to the DPM model, they 
provide potential sources for new nonparametric mixture models when hierarchical versions are 
developed. 

Exchangeability may also be undesirable if one believes that certain classes tend to co-occur 
more often than other classes. Teh et al. (2006) uses HDPM models to describe documents (the 
observations) as a mixture of various topics (the classes). In sufficiently broad collections of docu- 
ments, one may find that certain topics often appear together. For example, a document that focuses 
on the topic “politics” may be more likely to include the topic “economics” and less likely to include 
the topic “baseball.” In other words, the occurrence of politics and economics may be positively cor- 
related whereas the occurrence of politics and baseball may be negatively correlated. Unfortunately, 
the exchangeability property of the HDPM model prevents it from explicitly describing this corre- 
lation. Paisley et al. (2012) replaces the hierarchical Dirichlet process with a prior that they call the 
discrete infinite logistic normal distribution. This prior produces a mixed membership model that 
is able to explicitly describe correlated topics. Paisley et al. uses this prior to model a collection of 
10,000 documents from Wikipedia. 

5.9.3 Pitman- Yor Processes 

The Pitman- Yor process (Pitman and Yor, 1997), or two-parameter Poisson-Dirichlet process, pro- 
vides more flexibility in the clustering behavior of Dirichlet process mixture models. In addition to 
the base measure (H) and precision (a), there is a discount parameter, 0 < d < 1. The Pitman- Yor 
process allows negative values for a provided that a > — d. 

The Pitman- Yor process can be illustrated using a more general version of the Chinese restau- 
rant process. Consider a hierarchical model with 61 , 62 , ■■ ■ being a sequence of i.i.d. random vari- 
ables with random distribution P, where P has a Pitman- Yor process prior. Similar to the Chinese 
restaurant process, when a customer arrives he either joins an existing table or begins a new table. 
Let K be the current number of occupied tables, z r the dish for the zth customer, and z;, the vector 
of dishes except for z n : 
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P(z n = h\a,d,z 7l ) = l a _ d r 1+a ’ ~K+1 • (5 ' 20) 

Notice that the discount parameter, d , reduces the clustering effect. The number of previous cus- 
tomers is reduced by d, and this weight is instead placed on the probability of a new table. For 
the limiting case with d = 1 , Oi, , 0 n is an i.i.d. sample from H. On the other extreme, if 
d = 0, then the result is a Dirichlet process. As Teh (2006) shows, the number of unique values 
increases stochastically with both d and a. Recall that the stick-breaking process for the Dirichlet 
process constructs class proportions by setting 7tfc = fa rir=i( 1 “ 0r). where fa Beta(l,a). 

The Pitman- Yor process has a similar constructive definition using different beta marginals: fa 
Beta(l — d, a+id) instead. It produces heavier tails than the Dirichlet process: 7 decreases stochas- 
tically with k for both processes, but this effect is more extreme with the Dirichlet process. Pitman- 
Yor processes have been used in applications such as natural language processing (Teh, 2006; Gold- 
water et ah, 2006; Wallach et ah, 2008) and image processing (Sudderth and Jordan, 2008). 

Due to the stick-breaking construction, strategies for using Dirichlet processes can be adapted 
to Pitman- Yor processes. For example, it is straightforward to specify a hierarchical Pitman- Yor 
process by analogy to the hierarchical Dirichlet process. Teh (2006) constructs a MCMC sampling 
scheme for the hierarchical Pitman- Yor process, while Sudderth and Jordan (2008) develops a vari- 
ational inference algorithm. The variational function updates for the Pitman- Yor process are similar 
to the Dirichlet process updates, since the stick-breaking proportions still have beta distributions. 
An open problem is to develop a more efficient collapsed strategy that integrates over the class 
proportions. 


5.10 Conclusion 

Nonparametric mixtures have been an active area of research since Sethuraman (1994) provided 
the seminal stick-breaking representation of the Dirichlet process. The Dirichlet process mixture 
model and its extensions have been used in many domains for modeling a population with an un- 
bounded number of classes. The hierarchical Dirichlet process applies the same strategy for mixed 
membership models. Individual-level Dirichlet processes provide nonparametric mixtures for each 
individual, while a population-level Dirichlet process enables individuals to share statistical infor- 
mation. Such models have been used for survey analysis, document modeling, music models, and 
image analysis. 
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Political scientists have long observed that members of the public do not tend to exhibit highly 
constrained patterns of political beliefs and values in the way that partisan elites often do. In an- 
swering survey questions designed to measure latent ideology, they may act as if drawing responses 
randomly from different perspectives. In the American context, survey respondents frequently defy 
easy categorization as prototypical liberal or conservative, and yet their response patterns reflect 
structure that may be characterized by reference to such ideal types. We propose a mixed member- 
ship approach to survey-based measurement of ideology. Modeling survey respondents as partial 
members of a small number of ideological classes allows us to interpret the “mixed signals” they 
seem to send as a natural consequence of their competing inclinations. We illustrate our approach 
by reanalyzing data from a classic study of core beliefs and values (Feldman, 1988) and find that 
the most dramatic difference between prototypical members of the two main ideologies identified is 
not their vision of what society should be but rather their belief in what American society actually 
is. 
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6.1 Introduction 

A rich and important tradition in political science involves the analysis of patterns of political ide- 
ology. 1 Initially, the identification and characterization of such patterns had been performed in a 
rather ad-hoc manner, albeit based on careful philosophical and qualitative considerations within 
a theoretical framework established by leading scholars on the subject. This approach reached its 
peak with what many consider its best and most influential example. The American Voter, by Camp- 
bell et al. (1960, see especially ch. 9, “Attitude Structure and the Problem of Ideology”). Beginning 
with Converse (1964), a wave of critical rethinking on the subject emerged. Along the way, vari- 
ous researchers have applied modern empirical tools of survey analysis and statistical inference to 
revisit previously held assumptions about the structure of American political attitudes and beliefs 
(e.g., Marcus et al. (1974); Achen (1975); Stimson (1975); Feldman (1988); Conover and Feld- 
man (1984); Zaller (1992); Pew Research Center (201 1); Ellis and Stimson (2012)). In essence, this 
constitutes a measurement problem: we treat data, e.g., responses to survey questions, as manifest 
indicators of respondents’ latent dispositions on politics and policy. A number of different analytical 
tools have been utilized in approaching the problem, with factor analysis being the most frequently 
employed and item response models gaining in popularity. The basic goal of these endeavors is typ- 
ically to understand the structure of — or constraints on — beliefs, values, and attitudes at two levels: 
the individual and the population at large. Thus, the objects of interest include configurations of 
views that may be expected to coexist within a single person and the relative frequency with which 
these particular sets of views are held (or called upon in responding to survey items). 

A basic task in the study of ideology is the construction of typologies. Simple examples of ty- 
pologies are the well-known distinctions between “left” and“right” or “liberal” and “conservative.” 
Types such as “liberal” or “conservative” correspond to particular configurations of attitudes, val- 
ues, and beliefs that hypothetical adherents to these ideological types are supposed to exhibit. Of 
course nothing precludes the construction of typologies with more than two classes, as long as they 
offer meaningful analytical distinctions. For example, in the most recent (as of 2012) of a series of 
reports from the Pew Center for People and the Press, survey response data, grouped using clus- 
tering techniques, 2 revealed two distinct groups of Republican-leaning respondents (labeled post 
facto as Staunch Consen’atives and Main Street Republicans ); three categories of people inclined 
toward the Democratic party ( New Coalition Democrats, Hard-Pressed Democrats, and Solid Lib- 
erals; and three so-called “Middle Groups” ( Libertarians , Disaffecteds, and Post-Moderns ) (Pew 
Research Center, 2011). 3 In previous reports in the Pew series, issued in 1987, 1994, 1999, and 

'Except where noted, we use the term ideology somewhat generically to include any patterns of political and policy- 
oriented beliefs, values, and attitudes. Within political science, psychology, and the scholarly study of public opinion, the 
term is more narrowly defined as a highly constrained special case of this. 

-The full Pew report provides little detail on the clustering procedure employed, indicating only that scales were devel- 
oped using factor analysis, with clustering carried out on responses measured along the resulting scales. The report does not 
specify the particular clustering algorithm employed: “The typology groups are created using a statistical procedure called 
‘cluster analysis’ which accounts for respondents’ scores on all nine scales as well as party identification to sort them into 
relatively homogeneous groups.” Several competing cluster solutions were then compared, “evaluated for their effectiveness 
in producing cohesive groups that were sufficiently distinct from one another, large enough in size to be analytically practical, 
and substantively meaningful.” Other than citing the reliance on both statistical and substantive criteria, no additional detail 
is provided on the selection of the particular clustering appearing in the report. Aside from modeling assumptions implicit in 
the first-stage factor analysis, the overall approach is not model-based; certainly the clustering stage is model-free. 

3 Worth noting is that those identified as middle-groups were not necessarily neatly placed on a scale from liberal to 
conservative, as is typically done for "moderates” not clearly identifiable as liberal or conservative in contemporary terms. 
For example, Libertarians were those who tended to express strong views in favor of reduced government in all aspects of 
life, social and economic, leading to positions more associated with Republicans on economic issues and with Democrats 
on a number of social issues including support for political secularism. Meanwhile, Disaffecteds would not be fruitfully 
described in terms of a continuum between liberal and conservative, as they were typified more by their cynicism regarding 
politics and voting in general, yet they did tend to be somewhat more likely to consider themselves Republicans. While 
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2005, somewhat different typologies emerged from cluster analysis. It is unclear whether identical 
clustering techniques have been used in all Pew typology studies. 

A straightforward means of employing typologies in describing ideological inclinations is to use 
the types as direct descriptions of individuals. This allows us to partition the population of interest 
into a few disjoint homogeneous subsets, whose members share the same configuration of attitudes, 
beliefs, and values — for example, “liberals” and “conservatives” — or perhaps a larger set, as in the 
Pew report. This approach, however, requires an important trade-off between interpretability and 
ability of accounting for individual heterogeneity. On the one hand, we want to be able to rely on 
as few types as possible, each specifying meaningful distinctions pertaining to its members. On the 
other, we want to avoid oversimplifying the ideological phenomenon, thereby creating too coarse of 
a partition to adequately account for reality. 

A different approach, which we consider more natural for the study of political ideology, is the 
the use of a mixed membership (MM) framework. Mixed membership allows us to specify partial 
membership in multiple reference types and to quantify the strength of these memberships. This 
enables us to describe ideology in terms of a reduced number of prototypical configurations, such as 
liberal and conservative, while allowing these configurations to coexist within a single individual. 
In this way we could, for instance, describe an individual’s ideology as a combination of “14% 
conservative and 86% liberal.” 

In the remainder of this introductory section, we consider the challenge of measuring individu- 
als’ ideologies or political belief systems, typical methods for handling the task, and what a mixed 
membership modeling approach may offer political scientists wishing to answer key problems in 
the study of ideology. We also reflect briefly upon the notion of individuals as partial adherents to 
more than one ideological profile and look a bit more closely at why MM models provide such a 
suitable empirical counterpart to this analytical framework. The data with which we illustrate an ap- 
plication of this measurement model are described in Section 6.2. Next, in Section 6.3, we present 
a general mixed membership model for ideology, which treats survey respondents as if they were 
drawing upon partial membership in different latent ideological prototypes (or extreme profiles) in 
order to determine their responses. Some details regarding model fit are offered in Section 6.4, after 
which we discuss our results in Section 6.5. Finally, we conclude with a brief examination of how 
scholars of political psychology and public opinion stand to benefit more broadly from a mixed 
membership approach to their investigations, and offer a candid assessment of the limitations of the 
current model and possible remedies to be pursued in future work. 

6.1.1 Understanding Ideology and the Structure of Politically- Oriented Beliefs, 
Values, and Attitudes 

Among scholars who wish to understand how members of the public reach evaluations about parties, 
policies, and candidates, or simply about what they see on the evening news, there are a variety of 
approaches that may be taken. What is common to most of these is an assumption that people 
have certain dispositions, outlooks, or “basic orientations” (Feldman, 1988) upon which they rely in 
making such evaluations. Important debates have revolved around the question of whether ordinary 
people seem to apply “abstract ideological principles, sweeping ideas about how government and 
society should be organized” (Kinder, 1983, p. 390) in order to reach opinions on a variety of 
issues. For some, the notion of ideology itself is inherently unidimensional, a “general left-right 
scheme. . .organizing a wide range of fairly disparate concerns” (Zaller, 1992, p. 26). In this narrow 
sense of “ideology,” as the term is typically employed by political scientists, the observation that 

the Disaffected category had been identified in all three previous reports, Post-Moderns, on the other hand, were a newly 
emergent type. Young and heavily Democratic in party membership, they agreed with staunch liberals on such issues as the 
environment, immigration, and separation of church and state, yet were more wary of New Deal and Great Society policies. 
(Pew Research Center, 2011, pp. 20-21). 
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most people do not rely on such a unified structuring of political views and perceptions has long 
been of great interest (Campbell et al., 1960; Converse, 1964). And yet, even though members of 
the mass public do not think about most political issues using a left-right scheme to nearly the 
extent that political elites do, they neither approach each new object of evaluation independently, 
nor do they think or care enough about politics and policy to be able to do so (Page and Shapiro, 
1992; Lupia and McCubbins, 1998). Thus, the basic orientations people use to make sense of the 
political world may be somewhat varied and not well captured by a single — or possibly even a small 
number — of dimensions. 

We start from the widely shared assumption that such latent structure does exist in individuals 
and that it drives, probabilistically, their responses to survey questions, as proposed by Zaller (1992). 
It is this latent structure that we refer to here as ideology. As a prominent example of fundamental 
latent structures that do not meet the classical notion of a left-right overarching scale, some have 
suggested that particular nations or cultures have a few prominent core beliefs and values (e.g., 
communitarian or individualist orientations) that may have a high degree of popularity, but which 
may be of more or less importance in individuals’ psyches. Converse (1964, p. 211) posits that 
“psychological constraints” may be at play, whereby “a few crowning postures — like premises about 
survival of the fittest in the spirit of Social Darwinism — serve as a sort of glue to bind together many 
more specific attitudes and beliefs, and these postures are of prime centrality in the belief system as 
a whole.” 

Feldman (1988) and others follow up on this by examining the core beliefs and core values that 
may provide just this sort of psychological constraint. More recently, Ellis and Stimson (2012) make 
similar ideas the centerpiece of their “alternative conception of ‘ideology,’ . . . defined by citizens’ 
specific beliefs and values regarding what governments should and should not be doing.” This “oper- 
ational ideology” is distinguished from a “symbolic ideology” based on a person’s self-identification 
or one based on more vague sentiments about “ ‘government’ or ‘government programs’ broadly 
framed.” 

Our own conceptualization of ideology here follows that of Ellis and Stimson in its reliance 
on fundamental values and beliefs, especially as related to the appropriate role and obligations of 
government. Note that there is nothing inherent in such a definition that requires ideology to be 
unidimensional, although left-right orientation can certainly be a useful heuristic and is the focus of 
these authors’ own discussion. If ideology is instead conceptualized as the degree to which certain 
values and beliefs are salient for individuals as they evaluate political objects such as candidates 
and policy proposals, it may be less than ideal to measure ideology using a continuous, unbounded 
interval. For instance, we might expect that the vast majority of Americans will embrace values of 
reward for hard work, equality of opportunity, or freedom from governmental interference. Different 
kinds of people may find some of these considerations more compelling than others, but it would 
be surprising to find, for example, a group of Americans openly hostile to the notion of equal 
opportunity. Thus, we would like our measurement tools to be able to reflect this, by allowing us 
to distinguish groups of respondents not only by the values and beliefs which most starkly divide 
them, but also by how consistently they embrace those values and beliefs that are widely held. 

6.1.2 Measuring Ideology with Survey Data 

Regardless of the details of a latent structure approach to ideology (whether, for example, we treat 
latent and observed variables as continuously varying, ordinal, or measurable in terms of unordered 
levels), a key assumption is that all variation in survey responses can in fact be explained by the 
underlying latent structure. Survey responses will thus be conditionally independent given one’s 
ideology (i.e., belief/value structure). The matter of how to conceptualize the latent space, whether 
as a multidimensional continuum or a typology with multiple possible latent classes, is largely a 
pragmatic question about what best reveals an otherwise invisible structure to the researcher in a 
manner appropriate to the questions being asked. It may be that certain renderings of this space (as. 
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say, a unidimensional continuum) are rather limited in what they can tell us about how opinions are 
generated, but a choice among different representations will quite properly hinge upon what best 
allows lucid communication of findings. A discrete multivariate approach to the survey responses 
themselves makes sense, since such a treatment reflects the actual structure of Likert scale items 
typically found in public opinion surveys, allowing the researcher to avoid the false assumption of 
a continuous scale and comparable units separating response levels. 

The main approaches one may take in measuring ideology as a latent construct, inferred from 
responses to carefully selected survey questions, can be divided into heuristic or purely descriptive 
techniques on one hand, and principled, model-based approaches on the other. Among the former 
are basic principal components analysis (PCA), 4 multidimensional scaling (MDS) (Marcus et al., 
1974), Q-analysis (Conover and Feldman, 1984), and correspondence analysis (CORA), a categor- 
ical analogue to PCA. The latter includes factor analysis (FA) (Feldman, 1988) and other forms of 
latent structure analysis such as latent trait analysis/item-response theory (IRT) (Treier and Hilly- 
gus, 2009) and latent class analysis (LCA) (Taylor, 1983; Feldman and Johnston, 2009), as well as 
mixed membership/Grade of Membership models (MM/GoM), which may be thought of as either a 
sort of discrete factor analysis (Erosheva, 2002, pp. 16-20) or an extension of latent class analysis. 

Although there are a number of different options for handling the measurement of ideology 
(and beliefs, attitudes, values, etc.), the most common approach is some form of factor analysis. 
As the oldest latent variable measurement technique, and the most deeply ingrained in the habits 
of social scientists, it has the advantage of being easily related to ordinary regression techniques, 
and dominates the early literature on mass belief structures. Converse sets the precedent of actually 
equating the factors discovered or confirmed via FA with dimensions of belief structure, generat- 
ing political evaluations much as Spearman’s general intelligence quotient g generates responses to 
IQ test items (Spearman, 1904). “Factor analysis is the statistical technique designed to reduce a 
number of correlated variables to a more limited set of organizing dimensions'’ (Converse, 1964, 
our emphasis). One reason that factor analysis became the dominant approach to measuring latent 
ideological structure was that it was the earliest to be implemented in standard statistical computing 
packages. The representation of individuals’ ideals and beliefs located in a low-dimensional con- 
tinuous space also conformed well with evocative metaphors, adapted from economics, by which 
voters were considered to occupy a location in ideological space (or representing preferred tradeoffs 
among various competing public goods) and should be expected to prefer candidates located nearby 
(or with similar ideal balance among policy priorities) (Downs, 1957). Although the application of 
factor analysis in such situations is a deeply entrenched tradition in the study of ideology and public 
opinion, and is not an unreasonable approach, it is more appropriate for continuous multivariate data 
than discrete multivariate data typically found in survey responses. 

6.1.3 Citizens as Partial Adherents to Distinct Ideologies 

Converse (1964) set forth a research agenda, carried out in various forms over the decades since, 
aimed at understanding the “constraints” on patterns of belief which people may simultaneously 
hold. He refers to the “combinations” and “permutations” of “idea-elements” actually observed for 
individuals. When the constraints are severe enough, the resulting packet of beliefs to which a set 
of people subscribes is considered an ideology, in its strict sense as a psychological term of art. 
As Kinder (1983, p. 390) puts it, the notion of ideology upon which scholars once focused, but 
which guides the political evaluations of few actual citizens, consisted of “abstract ideological prin- 
ciples, sweeping ideas about how government and society should be organized." In this previously 

4 Social scientists regularly use the term "principal components analysis” interchangeably with (exploratory) factor 
analysis — and PCA is treated as a special case of FA in statistical computer packages — but we are referring to its standard 
statistical meaning, a process by which orthogonal basis vectors of the reduced-dimensional space are chosen to maximize 
variance accounted for with each additional dimension included. The goal of PCA, as with the other descriptive approaches, 
is simply dimension reduction, not modeling or inference regarding the data generation process. 
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dominant understanding of ideology, answers to a variety of public opinion questions could be 
thought to follow logically from a highly rigid, overarching outlook. Of course, if ideology operated 
as a purely deductive process among adherents, survey responses would be generated determinis- 
tically, and we should see certain beliefs and opinions always occurring together and others never 
co-occurring. 3 * 5 

Given such a narrow view of ideology, it is easy to look at actual patterns of response as evidence 
that people are haphazard in their thinking about politics and policy. Zaller (1992) countered this 
by developing a highly influential theory of how individuals formulate their responses to public 
opinion polls by randomly sampling from a number of privately held “considerations” relevant to 
the question at hand. Such a formulation helps account for a number of puzzling observations, such 
as the tendency for particular individuals to give different answers to the same question on different 
occasions. 

Latent class statistical modeling (Lazarsfeld and Henry, 1968; Goodman, 1974) corresponds 
fairly well to Zaller’s theoretical model, as each member of a distinct class responds by drawing a 
particular response from a distribution associated with that type. However, such an approach has the 
intrinsic limitation of assuming that individuals belong exclusively to just one ideological group and 
that each such group is homogeneous. This characterization leaves out the possibility of individuals 
who do not fully conform to any of the categories of a typology, but rather respond as something of 
a hybrid. 

Mixed membership models offer a conceptually attractive way of overcoming this limitation. 
Under mixed membership analysis, we still try to identify and characterize typical ideological 
classes. However, we regard individuals not as full members of those classes, but as partial mem- 
bers. This way, we take individuals’ responses as arising from all the distributions associated with 
the classes, weighted according to individually specified membership in all of them. 

Mixed membership models help formalize the idea of people as being partial adherents to differ- 
ent recognizable ideologies. Some — especially political leaders or “elites” — may adhere vigorously 
to a particular ideology and this would be reflected by full or nearly full membership in one group 
to the exclusion of the others. Others, perhaps the vast majority of the mass public, will draw on 
more widely dispersed vectors of partial membership in each. This offers a nice compromise be- 
tween a continuous Euclidean latent space on one hand and a categorical latent space on the other; 
patterns in the population and in individuals themselves are described in terms of easily understood 
prototypical distributions over categorical responses, and yet individuals are treated as combina- 
tions of the various prototypes, with their partial memberships allowed to vary continuously. The 
generating process of responses may indeed be thought of hierarchically: in encountering a survey 
item, the respondent first randomly draws an ideological profile based upon his or her relative de- 
gree of membership in each extreme profile and then randomly draws a response from that profile’s 
response distribution. 


6.2 Application: The American National Election Survey 

The data we analyze here come from a pilot study for the 1984 American National Election Study 
(NES), conducted by the Center for Political Studies of the Institute of Social Research at the Uni- 
versity of Michigan during the summer of 1983. The study’s purpose was to introduce and test new 
survey items, including a number of questions on core values that we will be using to illustrate 


3 Empirically, not only do people hold sets of beliefs and values that do not logically follow from one another, but 

we simultaneously hold beliefs and values that are logically inconsistent; the commonly held trio of preferences for more 

government spending, lower taxes, and a reduced deficit is but one prominent example. 
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our mixed membership modeling approach to measuring ideology. The complete data consist of 
reinterviews with 314 randomly selected respondents to the earlier 1982 National Election Study. 
We reanalyze the same 19 items investigated by Feldman (1988), 6 using only the 279 complete 
responses. 

The initial national sample, obtained for the 1982 NES and from which the individuals in the 
1983 pilot study considered here were subsampled, consisted of 1,418 respondents living within 
the primary areas of the survey’s county-based sampling frame. These areas were all located within 
the 48 contiguous states (not including military bases). They include 12 major metropolitan areas, 
32 other standard metropolitan statistical areas, and 30 counties or county-groups representing the 
rural subpopulation. Stratification was implemented independently within each of the four major 
geographical regions of the United States, as recognized at the time: northeast, north central (loosely, 
the midwest), south, and west, with each represented in proportion to population. The population 
under study included only United States citizens 18 years or older on Election Day, 1982. 

According to Feldman’s review of the existing literature at the time, three political attitudes 
dominated the American political psyche: “belief in equality of opportunity, support for economic 
individualism, and support for the free enterprise system,’’ and the 19 items to be analyzed were all 
developed with the intent to measure these components of ideology. We have labeled all of these 
with our own variable names, which are intended to capture the spirit of the questions and distin- 
guish similar items from one another based on the subtle wording differences. (See the Appendix 
for a complete list of the wording of questions.) Seven of the 19 items are intended to measure 
support for or belief in what Feldman calls “equal opportunity,” including statements that attribute 
inequality to inherent individual differences ( natural inequality 1 , natural inequality 2, and equality 
goal misguided ), one that claims a key role for society — perhaps understood as “government” by 
some respondents — in ensuring equal opportunity for success ( equal opportunity-society’s respon- 
sibility), assessments of whether inequality is a serious problem ( equal treatment and inequality big 
problem ), and one expressing support for the ideal of shared governance ( democracy ). The items 
dealing with “economic individualism” are closely related to one another and differ mostly in sub- 
tle ways, as indicated in our choice of variable names: hard work optimism, hard work realism, 
hard work idealism, ambition pessimism, effort pessimism, and individual responsibility for failure. 
Finally, the “free enterprise” items (less intervention is better, intervention populism, laissez-faire 
capitalism, regulations not a threat to freedom, intervention causes problems, and free enterprise 
not intrinsic feature of gov’t) allow respondents to weigh possible tradeoffs between positive and 
negative consequences of governmental regulations. All items consist of statements to which the 
respondents may say that they “agree strongly,” “agree but not strongly,” “can’t decide,” “disagree 
but not strongly,” or “disagree strongly.” In order to avoid overparametrization for such a small sam- 
ple yet capture the main qualitative differences in responses, we collapse these responses into three 
categories: agree, can’t decide, or disagree. 


6.3 Methods 

We apply a technique known as the Grade of Membership model (GoM) to the study of political 
ideology. GoM models (Woodbury et ah, 1978; Manton et ah, 1994; Erosheva et ah, 2007) are a 
sub-family of mixed membership models (Erosheva and Fienberg, 2005). They are well-suited to 
obtaining low-dimensional representations of high-dimensional multivariate unordered categorical 
data, such as those that are generated by opinion surveys. Similar to other MM techniques, GoM 

6 Feldman served on the ANES planning committee and was apparently directly involved with formulation of these 
questions, intended to measure three particular core values and beliefs of Americans. 
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models represent individuals as individually weighted combinations of a small number of “ideal 
individuals” or “extreme profiles” and use the data to estimate both the extreme profiles themselves 
and each subject’s membership structure. The Bayesian version of the GoM model, which we intro- 
duce here and use throughout this chapter, was introduced by Erosheva et al. (2007) and applied to 
the study of disability in elders. 

6.3.1 Grade of Membership Models 

We consider a sample of N individuals. Each individual i - 1, 2. .... has a corresponding J di- 
mensional vector of manifest variables, X t = (X u , . . . , 2Q j), that collects the outcomes of interest. 
We assume that the components of the outcome’s vector are unordered categorical variables with 
rij levels each (j = 1, . . . , J). In our application, these outcomes are the answers to each of the J 
questions of the survey. For convenience, we label component’s levels using consecutive numbers, 
X l:j G {1,2 , ,rij}. We assume that there is only one response vector per individual. 

GoM models assume the existence of a specific number, K , of “extreme profiles” or “pure 
types.” These are idealized versions of individuals that we use as reference types for specifying the 
response distributions of actual individuals. We assume that real individuals are combinations of 
these extremes types. To formalize this, we endow each individual with his or her own membership 

vector , gi = (gn, . . . , gik, • ■ • , gm)- Each component of gi, g^ for k = 1 I\ specifies the 

degree of membership of individual i in the corresponding extreme profile among all K. We restrict 
membership vectors so that g t G A K _ x = {(g 1} . . ,,g K ) : g k > 0, 9k = 1}- where A I< _ 1 

is the K — 1-dimensional simplex. Ideal individuals of the fcth extreme profile have a membership 
vector whose fcth component is g,k = 1 and the rest, zeros. 

We characterize the extreme profiles as follows: For any individual that is a full member of the 
fcth extreme class (i.e., such that its membership vector has g ^ = 1 and g^ 1 = 0 for fc' fc), we 
assume that the response distribution of the jth entry of the manifest variables vector is a simple 
discrete distribution: 


Pf (X-jj 9ik 1) ( 0 ’ 


( 6 . 1 ) 


where l G {1, 2, . . . , rij} and X jk = (A jfc (l), . . . , Xjk(n 0 )) € A nj ._i. 

For generic individuals with membership vector g L , we characterize their component-wise re- 
sponse distribution as the convex combination 


K 

Pr (X-ij = l\gi) — ^ ^ QikAjk (Q ■ 

fc= i 

Geometrically, this specification means that the individual response distributions are located within 
the convex hull defined by the extreme profiles. 

We further assume that the item responses j are conditionally independent given membership 
vectors. This local independence assumption (Holland and Rosenbaum, 1986) expresses the idea 
that the membership vector g completely explains the dependence structure among the J binary 
manifest variables. By making this assumption, we can construct the conditional joint distribution 
of responses: 


J I< 

Pr (Xi =Xi\g.i) = nz 9ik^jk(%ij ) • 

3= 1 k = 1 
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Assuming further that the individuals are randomly sampled from the population, we finally 
obtain 


N J K 

Pr (X = x\g) = nnz (6.2) 

*= 1 j= 1 k- 1 

Membership vectors are unobserved latent quantities. In order to derive an unconditional ex- 
pression for the joint distribution of observed responses, we assume that membership vectors are 
sampled from a common distribution, G a , with support in Ak~i', whereby 

N . J K 

Pr (X = x ) = n / nz UkAjk i'^ij ) G (dg) . (6.3) 

i=l. i—lk=l 

/\k-i j 

An interesting perspective on the GoM model results from considering the following equiva- 
lent data generation process, which generates N variates Xi = (xn, . . . , Xu) for i = 1 ,N, 
according to a GoM model with K extreme profiles (Haberman, 1995; Erosheva et al., 2007): 


GoM Data Generation Process 

For each i = 1, 2, . . . , N 

Sample g t = (g a , ...,g iK ) ~ G\ 

For each j = 1, 2, . . . , J 

Sample z tJ - Discretely (ga, g i2 , gw)', 

Sample y 1/} - Discrete , :ri (\ jz .. (1), . . . , X jz .. ( rig )). 


According to this process, we can understand the generation of individual GoM variates as 
arising from a two-step procedure: (1) Given a membership vector g, , we obtain the components 
of the response vector one by one. (2) For each of the J components, we determine an effective 
extreme profile — which is allowed to vary from component to component — by sampling it with 
probabilities given by g, . Next, we sample the actual response as if the individual were a full member 
of that extreme profile for that question. The multiple membership is reflected by the fact that the 
individual answers to each question are generated according to different extreme profiles. 

6.3.2 Full Bayesian Specification 

For this application we closely follow Erosheva et al. (2007). We complete the specification of the 
GoM in a full Bayesian fashion by choosing the distribution of membership vectors, G a , and a prior 
distribution for all parameters of our model. 

For the membership vectors distribution G a , we specify their common distribution as 

g, Dirichlet(a), 

with a = (ao • £i, . . . ,ao • £k), cto > 0 and £ = (£i, . . . , jy) G A k-i- Parameter £ is the 
expected value of distribution G a . Using the generative process interpretation from the previous 
section, each component of £ represents the expected proportion of item responses generated by 
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each of the extreme profiles; thus we can understand it, informally, as the relative importance of 
each extreme profile in the population. Parameter ao is a concentration parameter that expresses 
how concentrated the probability distribution is about its expected value (as ao increases) or near 
the vertices of the simplex A^-i (as ao decreases). When ao = K and £ = (1/A', . . . , 1/AT) so 
that a = Ik, distribution G a becomes uniform over Ak-i- 

We specify the hyperpriors of G a as ao ~ Gamma (1,2) (in shape/rate parametrization) and 
/ ~ Dirichlet(lK)- These choices specify a priori ignorance about the relative importance of each 
extreme profile in the population and a slight preference (although not so strong) for small values of 
ao, with individuals likely to be relatively pure adherents to one or another extreme profile unless 
the data provided suggest otherwise. 

Each conditional response distribution for item j and extreme profile />:, Xpf-) consists of n 3 
scalar parameters restricted to the simplex A n . For these parameters, we choose the prior distribu- 
tion 

A j k = (Ajfc(l), . . . , A jkirij)) ~ Dirichlet(l„), 

or a uniform distribution over A„._i. 


6.4 Fitting the Models 

We have employed an MCMC algorithm to obtain samples from the posterior distribution of param- 
eters given the data. The algorithm is an extension for multilevel variables of the sampler presented 
in Erosheva et al. (2007), originally developed for binary variables. This sampler is based on a 
data augmentation strategy using the equivalent generative process outlined in Section 6.3. We have 
fitted models with AT = 2, 3, 4, and 5 extreme profiles using the prior distributions described in 
Section 6.3.2. 

Similar to other latent structure models, GoM models are invariant to permutations of the ex- 
treme profile labels. For this reason we have re-labeled extreme profiles according to the decreasing 
sequence of the posterior estimates (posterior means) of the components of /. This ordering makes 
comparisons easier. 

Table 6.1 shows posterior estimates (posterior means and standard deviations) for the 
population-level distribution of membership vectors ao and / for models with K = 2, 3, 4, 5 ex- 
treme profiles. In all cases, posterior estimates of ao are relatively small. This causes most mem- 
bership vectors in the population to be dominated by a single extreme profile. However, ao is large 
enough so that the mixed membership becomes an important structural feature. For instance, for 
model AT = 3 the probability that a single individual’s responses to different questions are drawn 
from more than one extreme profile is approximately 0.65. Not surprisingly, given the scarcity of 
data (n = 279), posterior dispersions are rather large. 

Investigating the posterior estimates of £ we see that for K > 3, all models feature two dominant 


K 

QiO 

ft 

£2 

ft 

£4 

£5 

2 

0.510(0.236) 

0.971 (0.009) 

0.029 (0.009) 




3 

0.765 (0.233) 

0.604(0.129) 

0.373 (0.129) 

0.023 (0.007) 



4 

0.772 (0.238) 

0.591 (0.131) 

0.384 (0.131) 

0.016 (0.010) 

0.010 (0.010) 


5 

0.827 (0.229) 

0.613 (0.122) 

0.359 (0.121) 

0.013 (0.010) 

0.011 (0.011) 

0.003 (0.003) 


TABLE 6.1 

Posterior estimates of parameters ao and £ for models with AT = 2, 3, 4, and 5 extreme profiles. 
Numbers between parenthesis are posterior standard deviations. 
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FIGURE 6.1 

Comparison of posterior estimates of the first extreme profile, Aji(l), for model with K = 3 versus 
models with I\ = 4 and K = 5 extreme profiles. 


extreme profiles, with « 0.60, £2 ~ 0.36, and K — 2 profiles with very small values of Closer 
inspection reveals that those two dominant extreme profiles (k 1 and k = 2) are very similar for 
all models with K > 3. Plots in Figure 6.1 show the posterior estimates of (parameters of 

the first extreme profile) for the model with K = 3 extreme profiles versus their counterparts for 
models with I\ = 4 and K = 5 extreme profiles. We see that all points lie almost perfectly on the 
main diagonal. The situation is similar for the second extreme profile ( k = 2, not shown). 

Based on these observations and the qualitative inspection of the estimates, we have selected 
the model with I\ = 3 for our inferences. This was the smallest model for which the two domi- 
nant extreme profiles appear, and any more complex model basically gives us the same information, 
supplemented only by additional extreme profiles with very small values of £*.. Interestingly, the es- 
timate of the conditional response distribution. A, for the dominant extreme profile in model K = 2 
is almost numerically equal to the weighted sum (by £1 and £ 2 ) of the estimates of the two dominant 
extreme profiles of our selected model ( K = 3, but also for K = 4 and K = 5). This suggests 
that the two dominant extreme profiles in our chosen model are basically a split of the first domi- 
nant profile of model K = 2. Our attempts to perform more formal evaluations failed to produce 
anything illuminating. Posterior predictive counts were difficult to produce and analyze due to the 
small sample size and large number of variables. We also evaluated the Advances in Computational 
Mathematics (AICM) index (Raftery et al., 2007; Erosheva et al., 2007), which selected a model 
with I\ = 2 extreme profiles. 
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6.5 Results and Discussion 

6.5.1 Results 

The multinomial conditional probabilities of responses given full membership, Xp-, define the ex- 
treme profiles. Two of the three extreme profiles, k = 1,2, account for 98% of the item responses, 
while profile k = 3 generates around 2%, and seems to be associated with a high probability of 
responding “can’t decide” (l = 2) to the survey items (estimated as anywhere from around 10% to 
83% for different items when k = 3). 

Analyzing the posterior membership estimates of the 279 respondents, the vast majority have 
partial membership of less than .01 in k = 3. Just five members of the population would answer 
survey questions as primarily a member of this class (from .71 to .91 membership), and only 23 of 
the 279 have greater than 2% membership in this “neutral” prototype. 

In order to better characterize the two dominant ideal types identified, let us examine each 
simply in terms of the probability of agreeing or disagreeing with each item when answers are based 
on considerations 7 rooted in one or the other dominant profile. For convenience — and in order to 
connect our findings with standard writings on American ideology — we use the term conservative 
as shorthand for k = 1 and liberal for k = 2. For reasons that will become apparent shortly, 
we might more accurately refer to these as, respectively, something like Individualist/Believers in 
Realized American Ideals and Social Responsibility-Oriented/Still Waiting for American Ideals to 
be Realized. Given the clumsiness of such labels, we will stick with the more common ideological 
identifiers, but consider them to be best understood in terms of response distributions associated 
with survey items, which we are about to examine. 

6.5.2 Analyzing the Extreme Profiles: Americans’ Core Values vs. Core Beliefs 

In considering the estimated response distributions Afc for the “conservative” (k = 1) and “liberal” 
(k = 2) ideal types (Table 6.2), one important thing to notice is the presence of certain high-valence 
items, enjoying the consensus one might expect of core values shared by most members of a society. 
For such items, the distinction between liberals and conservatives is not especially stark, but to the 
extent that one type is more predictably supportive of a statement than the other, the differences 
are in the direction that would be expected. For instance, both prototypical respondents would be 
unlikely to agree that our inherent differences should lead us to give up on the goal of equality 
(j = 2, equality goal misguided), but the prototypical conservative may have a greater probability 
of breaking with the norm: A 2 i(l) = .272 ( sd . = .045) as opposed to A 22 (l) = -136 ( sd . = .063) 
for the prototypical liberal. Whether responding as a liberal or conservative, an individual would 
very likely support the democratic ideal of governance by all sorts of people — not only the most 
successful — (~ .94 or .86, respectively) as well as the notion that society has a responsibility to 
ensure equal opportunity of success for all (~ .89 or .82, respectively). Yet, while a commitment to 
the ideal of equal opportunity in personal and public life is widely embraced, so too is the recog- 
nition that people are not equally well-suited to leadership positions ( natural inequality 1 (~ .87 
and .76) and natural inequality 2 (~ .95 and .85) among prototypical conservatives and liberals, 
respectively.) 


7 Here we intentionally use the term considerations, from Zaller (1992) in order to emphasize the connection between 
our measurement strategy and Zaller’s theoretical framework. Just as Zaller depicts respondents drawing at random from 
an unobserved distribution of considerations in order to answer each question, we model such individuals as drawing an 
ideal type at random in proportion to their own latent membership vector, and then generating a response according to the 
distribution associated with the selected ideal type on the particular item. 
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Ajfe(Z) 


j 

Question 

Level: Z = 

k = 1 

1 (Agree) 
k — 2 

l = 3 (Disagree) 
k = 1 k = 2 

1 

Equal treatment 

0.61 (0.10) 

0.92 (0.05) 

0.37 (0.10) 

0.07 (0.05) 

2 

Equality goal misguided 

0.27 (0.05) 

0.14 (0.06) 

0.7 (0.05) 

0.83 (0.06) 

3 

Equal opportunity society’s responsibility 

0.82 (0.05) 

0.89 (0.05) 

0.17(0.05) 

0.10(0.05) 

4 

Natural inequality 1 

0.87 (0.04) 

0.76 (0.07) 

0.12(0.04) 

0.22 (0.07) 

5 

Natural inequality 2 

0.95 (0.02) 

0.85 (0.06) 

0.05 (0.02) 

0.14 (0.05) 

6 

Democracy 

0.86 (0.04) 

0.94 (0.04) 

0.14(0.04) 

0.05 (0.04) 

7 

Inequality big problem 

0.30 (0.14) 

0.88 (0.07) 

0.69 (0.14) 

0.10(0.07) 

8 

Hard work optimism 

0.97 (0.02) 

0.45 (0.17) 

0.02 (0.02) 

0.54 (0.17) 

9 

Hard work realism 

0.12(0.06) 

0.47 (0.09) 

0.87 (0.06) 

0.52 (0.09) 

10 

Individual responsibility for failure 

0.77 (0.06) 

0.19(0.12) 

0.22 (0.06) 

0.77 (0.12) 

11 

Ambition pessimism 

0.76 (0.05) 

0.88 (0.05) 

0.23 (0.05) 

0.11 (0.05) 

12 

Hard work idealism 

0.64 (0.06) 

0.21 (0.12) 

0.35 (0.06) 

0.78 (0.11) 

13 

Effort pessimism 

0.75 (0.07) 

0.95 (0.03) 

0.25 (0.07) 

0.04 (0.03) 

14 

Less intervention is better 

0.81 (0.05) 

0.42 (0.13) 

0.17(0.05) 

0.55 (0.13) 

15 

Intervention populism 

0.62 (0.06) 

0.83 (0.06) 

0.36 (0.05) 

0.11 (0.06) 

16 

Laissez-faire capitalism 

0.36 (0.05) 

0.07 (0.07) 

0.63 (0.05) 

0.91 (0.07) 

17 

Regulations not a threat to freedom 

0.33 (0.05) 

0.49 (0.08) 

0.66 (0.05) 

0.49 (0.08) 

18 

Inten’ention causes problems 

0.94 (0.04) 

0.58(0.13) 

0.05 (0.03) 

0.40 (0.13) 

19 

Free enterprise not intrinsic feature of gov’t 

0.12(0.07) 

0.41 (0.08) 

0.87 (0.07) 

0.58 (0.08) 

TABLE 6.2 





The two dominant extreme profiles for K = 

3: Profile k 

= 1 (60.4% 

> of responses) versus Pro- 


file k = 2 (37.3% of responses). Numbers in parentheses are posterior standard deviations. The 
grouping of items is based on Feldman (1988) and the original intent of the survey questionnaire 
design: the first concern Equal Opportunity, the second. Economic Individualism-, and the third. 
Free Enterprise. The variable names, generic in the original, are our own. 


In order to clarify which items are most important in defining each dominant extreme profile, 
we introduce the quantity 


CoheSjk = 


max {A 

= 1 , — ,nj 

min {A jk (l)y 

= !»•••. n } 


(6.4) 


or the cohesion of extreme profile k with respect to item j. The cohesion scores reflect the reliability 
with which each extreme type responds to an item. In Zaller’s (1992) theory of survey response 
to opinion polling, this might correspond to individuals tending to answer a question predictably, 
perhaps because nearly all relevant considerations lead to the same response. This may alternatively 
be thought to measure the cohesiveness of hypothetical adherents to each extreme profile. 

Additionally, we consider the hypotheses 


DRj : argmax{Aji(Z)} ^ arg max (A., 2(0} ■ (6-5) 

l=l,...>rij 

for j = 1, ..., J. Hypothesis DRj states that full adherents to the two dominant profiles have dif- 
ferent modal responses to a given item j. Obtaining posterior estimates of the probability of DRj 
enables us to draw inferences about how well different items distinguish the extreme profiles from 
one another. For example, a posterior probability value 0.01 for item j = 7, inequality big problem, 
Pr[DRj\Data], means that, given our data on 279 respondents, we find only a 1% chance that the 
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most likely response to the item by the two types of pure respondents are the same; it is, rather, 
highly probable that the top response for liberals is recognition of inequality as a big problem while 
conservatives are more likely than not to deny inequality as a persistent issue. 

Table 6.3 shows our posterior estimates (posterior means) of Cohesjk and our estimated poste- 
rior probabilities of hypotheses DRj , for extreme profiles k = 1 (conservative) and k = 2 (liberal), 
and for every item in the survey ( j = 1, ..., 19). Analyzing the cohesion scores we see that for certain 
items, both dominant extreme profiles are predictable and give identical responses (e.g., interven- 
tion causes problems); for others they reliably give opposite responses, i.e., one type is expected to 
agree and the other to disagree with an item (e.g., inequality big problem); for still other items, pro- 
totypical adherents to one profile are highly likely to give their modal response while prototypical 
adherents of the other are far less predictable (e.g., laissez-faire capitalsm). 

For the first set of five items listed in Table 6.3, there is a greater than .50 posterior probability 
that pure liberals and pure conservatives will disagree in their favored responses. We can be highly 
confident (> .99) that prototypical liberals and conservatives — at least to the extent that these labels 
may be appropriately applied to k = 1 and k = 2 — will tend to disagree when it comes to their 
reactions to hard work idealism (“If people work hard, they almost always get what they want”), in- 
dividual responsibility for failure (“Most people who don’t get ahead should not blame the system; 
they really have only themselves to blame”), and inequality big problem (“One of the big problems 
in this country is that we don’t give everyone an equal chance”). Conservatives are also likely to 
disagree with liberals when it comes to the item most closely associated with contemporary defi- 
nitions of American liberalism and conservatism: less intervention is better (“The less government 
gets involved with business and the economy, the better off this country will be”). Similarly, we 
expect them to disagree on an item that perhaps best captures a quintessential American belief that 
hard work pays off: hard work optimism (“Any person who is willing to work hard has a good 
chance of succeeding”). For the remaining 14 items, the bulk of our posterior probability is placed 
on identical modal responses for liberals and conservatives, though they may differ substantially in 
how predictable they are in choosing this modal response. For example, while both are more likely 
to agree with the statement that “There are many goods and services that would never be available 
to ordinary people without governmental intervention” ( intervention populism ), the mean poste- 
rior cohesion score for a prototypical liberal is 58, in contrast to around only 2 for a prototypical 
conservative. For eight items, we can be virtually certain that both dominant extreme profiles share 
a modal response. 

If we look closely at the three items that most clearly distinguish our prototypical liberals from 
conservatives, two have been identified by Feldman (1988) as measures of the core belief in Eco- 
nomic Individualism and one as a measure of the belief in Equal Opportunity. All three, however, 
tap into beliefs about what is rather than what should be; 

• Hard work idealism ( j = 12): If people work hard, they almost always get what they want. 

• Individual responsibility for failure ( j = 10): Most people who don’t get ahead should not 
blame the system; they really have only themselves to blame. 

• Inequality big problem ( j = 7): One of the big problems in this country is that we don’t give 
everyone an equal chance. 

Indeed, much of what seems to separate the response distributions for the two dominant ideal types 
has to do with how well respondents view the United States as actually living up to the ideals shared 
by many in both camps. In order to appreciate this, a distinction should be drawn between beliefs 
and values. 

According to Glynn et al. (1999), “Values are ideals. Beliefs represent our understanding of 
the way things are, but values represent our understanding of the way things should be” (p. 105). 
The difference between beliefs and values is not always well delineated and, in fact, some survey 



A Mixed Membership Approach to Political Ideology 


133 


questions may capture aspects of both. In certain cases, what is presented as a value may imply 
some belief about the way things actually are, and this may affect the responses of some individuals 
surveyed. For example, while a majority of individuals answering from either principal extreme 
profile claim a belief that equal treatment leads to fewer problems ( equal treatement), pure liberals 
are nearly in uniform agreement with the statement, but conservatives have around a 38% chance of 
disagreeing with the sentiment. Why might there be resistance among conservatives, who otherwise 
generally embrace the goal of equality and the notion that society has a responsibility to ensure equal 
opportunity, according to their responses to other survey questions? Hidden within the question is an 
implied belief about the way things actually are: “If people were treated more equally in this country, 
we would have many fewer problems [than we have now],” with the italicized words as implied 
subtext. So if one believes that inequality leads to problems, but also that people already are treated 
equally and, perhaps, that commonly advocated programs aimed at the issue (e.g., affirmative action) 
are misguided, one might disagree with the survey item. Thus, one’s national pride and a tendency 
to view the nation as having already realized the ideals of equal opportunity are considerations 
of prototypical conservatives that have a non-trivial probability of being primed by the choice of 
wording here. 

Of the five items on which pure liberals and conservatives are expected to differ on their most 
likely responses, four address the locus of responsibility for individual success and failure. On all 
four measures, conservatives embrace the notion of individual responsibility for success and failure, 
while liberals are less convinced. Conservatives are unified in their belief that hard work has a “good 
chance” of yielding success, while liberals tend to disagree (albeit only at a 3:2 ratio). On a similar 
question, phrased differently, liberals largely reject an idealistic view of hard work, with a .79 prob- 
ability of disagreeing that it will “almost always” lead to satisfying results, while conservatives have 
a .63 of embracing such idealism. On one of the most divisive questions, prototypical conservatives 
agree three to one that individuals should blame themselves if they “don’t get ahead,” while liberals 
find “the system” more at fault, by more than four to one! When it comes to the explicit assertion 
that inequality — specifically a lack of equal opportunity — remains a “big problem” in the United 
States, there is again a clear distinction between the two dominant types of respondents; liberals 
identify inequality as a big problem at over seven to one, while conservatives are nearly two to one 
in the opposite direction. 

In short, the results of our Grade of Membership analysis reveal hidden structure in the beliefs 
and values of survey respondents missing from the original factor analytic results in Feldman (1988). 
While several core values are widely embraced across extreme profiles, prototypical liberals tend 
to be more unified in their support of those ideals typically associated with them (equality and 
democratic principles), while prototypical conservatives tend to be more consistent in embracing 
values tied to their own central narratives (rewards of hard work, individual responsibility and self- 
reliance, and antipathy towards government intervention). Only a few survey items serve to starkly 
contrast the two dominant extreme profiles, and those that most clearly distinguish them involve 
beliefs about the United States in which they live rather than simply ideals about their nation as it 
could be. 
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CoheSjk 


j 

Question 

k = 1 (cons) 

k = 2 (lib) 

DRj 

12 

Hard work idealism 

1.95 

22.29 

1.00 

10 

Individual responsibility for failure 

4.14 

16.08 

1.00 

7 

Inequality big problem 

11.29 

14.56 

0.99 

14 

Less intervention is better 

5.79 

19.11 

0.66 

8 

Hard work optimism 

330.92 

8.62 

0.61 

17 

Regulations not a threat to freedom 

2.13 

14.48 

0.46 

9 

Hard work realism 

29.55 

7.43 

0.36 

18 

Intervention causes problems 

67.89 

11.28 

0.27 

1 

Equal treatment 

1.89 

7.35 

0.16 

19 

Free enterprise not intrinsic feature of gov’t 

70.77 

58.83 

0.14 

15 

Intervention populism 

1.79 

58.46 

0.02 

16 

Laissez-faire capitalism 

1.82 

14.02 

0.01 

13 

Effort pessimism 

3.31 

21.71 

0.00 

11 

Ambition pessimism 

3.53 

5.60 

0.00 

6 

Democracy 

6.90 

5.89 

0.00 

5 

Natural inequality 2 

49.45 

7.53 

0.00 

4 

Natural inequality 1 

8.18 

32.55 

0.00 

3 

Equal opportunity society’s responsibility 

5.15 

55.54 

0.00 

2 

Equality goal misguided 

2.66 

17.90 

0.00 


TABLE 6.3 

Extreme profiles for K = 3: Profile k = 1 (60.4% of responses) versus Profile k = 2 (37.3% of 
responses). Cohesion scores represent posterior means for the odds that a prototypical adherent will 
give the top response to the question. 
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6.6 Conclusion 

Typologies are ubiquitous in political science, providing useful frameworks for understanding vari- 
ation in ideology, beliefs, and values, as well as conceptual development of many other areas (e.g., 
comparative political systems and conflicts, policy analysis, and identity politics). Typically, the 
manner in which such typologies are developed is ad hoc rather than model-based, reflecting a pri- 
ori qualitative judgments about how the political world is naturally partitioned. As we demonstrate, 
it is also possible to use a mixed membership approach to construct typologies from data in a princi- 
pled, model-based way, with qualitative interpretation taking place only after model-based estimates 
have been drawn, and without imposing a possibly artificially crisp partition on the population of 
interest. In some cases, the resulting typology will correspond well to what we expect and in others 
it might offer surprises. In our illustration, we see a bit of both: on one hand, two dominant ex- 
treme profiles emerge, which correspond roughly to what might be identified as conservatives and 
liberals, but on the other hand, the nature of these profiles presents us with a more nuanced view 
of what these ideal types look like. Analyzing survey response data with reference to such proto- 
types, we maintain a measure of simplicity that promotes understanding, while accommodating the 
heterogeneity actually present in real-world populations. 

Where some have previously analyzed public opinion in terms of types, they have used ad hoc 
clustering techniques (Pew Research Center, 201 1) without justification for the particular algorithm 
or verification that the results are robust to other choices of clustering routine. Latent class analysis is 
a surprisingly underutilized technique in political science that improves upon this by assuming that 
data are generated from distributions associated with distinct latent types of respondents (Feldman 
and Johnston, 2009) . Our mixed membership approach may be thought of as a generalization of 
LCA, combining advantages of categorical data inference available in LCA with advantages of 
continuity assumptions from factor analysis. In fact, one way of thinking about what we are doing 
here is that we have taken what would be allocated to error in the case of LCA and have integrated 
it with the structural component of our model. In a latent class analysis, good model fit must reduce 
the “error” associated with individuals who mostly answer as if they belonged to one type, but 
respond anomalously on certain questions. Allowing for Grades of Membership in these classes 
(reconceptualized as ideal types) takes what would otherwise be considered noise and attributes it 
to a person’s internal complexity. 

We see great potential for mixed membership modeling in the study of political psychology, be- 
havior, and public opinion. It offers a sort of compromise between the concreteness of classification 
by types and the flexibility of multidimensional continuous latent variables as in factor analysis. 
Informal and qualitative accounts of political behavior, for example, rely heavily on such classi- 
fications as the “likely voter,” the “independent voter,” and the “alienated working class voter”; 
for the most part, evocative labels such as these are replaced in quantitative analysis by measures 
further removed from the familiar and useful prototypes, but which utilize interval-level scores to 
reflect diversity across the population. Assuming that individuals manifest partial membership in 
multiple recognizable types lets researchers use prominent response patterns as familiar reference 
points without reducing people to stereotypes. Furthermore, it allows us to discover new ideal types, 
patterns that we might not have otherwise noticed. 

Mixed membership is a general idea that can be implemented and exploited in many ways. The 
particular technique that we employed in this application, the Grade of Membership model, has 
a fairly simple structure and is an appropriate tool for the basic soft clustering that we presented 
here. However, in order to investigate a wider array of political science research questions and to 
better use the available data, we need to develop more tools. First, we would like to investigate the 
relationship between individual ideology and other relevant variables, like cohort or income. This 
can be achieved by incorporating covariates into the model. One possible approach, introduced by 
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Manrique-Vallier (2010), is to specify the population-level membership distribution conditional on 
the covariates, keeping the extreme profiles common to the whole population. Such an extension 
would allow to estimate the effect of the covariates into the membership composition of the indi- 
viduals, enabling us to answer questions such as: “Are older generations more conservative than 
younger ones?” 

Another useful direction would be to use individual estimates of membership as predictors for 
some dependent variable of interest, for example, one’s position on a particular policy issue, or reac- 
tion to an experimental intervention. This is a common approach with factor analysis, where we use 
scores on estimated dimension-reduced scales as inputs into a regression-style model of substantive 
interest. We can also understand mixed membership analysis as a dimension reduction technique: 
in our example we reduced the 19-dimensional response vector into a three -component member- 
ship vector that lies in a two-dimensional unit simplex. Thus, performing a similar analysis with 
membership vectors instead of factor loadings as inputs would achieve a similar aim, but make for a 
more intuitive analysis of results. For example, we could replace statements such as “For every stan- 
dard deviation increase on the economic individualism factor, we expect. . . ” — where the meaning 
of this factor is obscured — with more appealing statements of the form “an extra 25% conservatism 
leads to. . . ” The actual implementation of this idea carries some difficulties though. While it might 
be tempting to perform regular regression analysis conditional on posterior point estimates of the 
individual membership scores, we have to make sure to correctly reflect the inherent posterior un- 
certainty of these estimates in the regression. One possible approach is to set up comprehensive 
hierarchical models that include the mixed membership and the regression parts. Another possible 
approach is to obtain samples from the posterior distribution of individual membership vectors and 
use multiple imputation techniques (Rubin, 1987) to perform the combined analysis. 

One limitation of GoM models, stemming from their simple, local independence structure, is 
that given membership, all answers to questions are taken to be essentially equivalent. However, 
researchers usually design and organize surveys so that questions belong to specific domains, such 
as “economic issues” or “social issues,” and therefore illuminate different (often known) aspects of 
ideology. One can envision a hierarchical extension in which, in addition to the mixed membership 
structure, questions are organized into domains and interact with the membership in different ways. 
The structure could be such that we take the original classification of questions as prior information 
with some degree of uncertainty, and learn the rest from the data. 

If we or others are to effectively extend mixed membership analysis in any of these potential 
directions, we would be well-advised to keep in mind the simple observation of Achen (1975), who 
reminds us: “The greater the distance from data to conclusions, the more opportunity for errors.” 
While latent variable modeling techniques grant us a principled way to measure underlying hidden 
concepts only indirectly revealed through survey responses, this typically comes at the expense of 
transparency; the connection between abstractions such as factor loadings and the observed data is 
often obscured in the minds of researchers and their audience. Among the various advantages of the 
MM/GoM approach to survey data in seeking to better understand the structure of mass attitudes, 
one of the most compelling is that it allows us the luxury of abstraction while preserving the close 
connection to data. 


Appendix: Survey Items 

Equal Opportunity 

• Equal treatment: If people were treated more equally in this country, we would have many fewer 
problems. (V2169/V3120) 
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• Equality goal misguided: We should give up on the goal of equality, since people are so different 
to begin with. (V2172/V3122) 

• Equal opportunity-society’s responsibility: Our society should do whatever is necessary to make 
sure that everyone has an equal opportunity to succeed. (V2175/V3123) 

• Natural inequality 1: Some people are just better cut out than others for important positions in 
society. (V2178/V3121) 

• Natural inequality 2: Some people are better at running things and should be allowed to do so. 
(V2250) 

• Democracy: All kinds of people should have an equal say in running this country, not just those 
who are successful. (V2253, Not in wave 2) 

• Inequality big problem: One of the big problems in this country is that we don’t give everyone 
an equal chance. (V2256, V3125) 

Economic Individualism 

• Hard work optimism: Any person who is willing to work hard has a good chance of succeeding. 
(V2170) 

• Hard work realism: Hard work offers little guarantee of success. (V2173) 

• Individual responsibility for failure: Most people who don’t get ahead should not blame the 
system; they really have only themselves to blame. (V2176) 

• Ambition pessimism: Even if people are ambitious, they often cannot succeed. (V2251) 

• Hard work idealism: If people work hard, they almost always get what they want. (V2254) 

• Effort pessimism: Even if people try hard, they often cannot reach their goals. (V2257) 

Free Enterprise 

• Less intervention is better: The less government gets involved with business and the economy, 
the better off this country will be. (V2171) 

• Intervention populism: There are many goods and services that would never be available to 
ordinary people without governmental intervention. (2174) 

• Laissez-faire capitalism: There should be no government interference with business and trade. 
(V2177) 

• Regulations not a threat to freedom: Putting government regulations on business does not en- 
danger personal freedom. (V2252) 

• Intervention causes problems: Government intervention leads to too much red tape and too 
many problems. (V2255) 

• Free enterprise not intrinsic feature of gov’t: Contrary to what some people think, a free enter- 
prise system is not necessary for our form of government to survive. (V2258) 
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Evaluation of sensitivity and specificity of diagnostic tests in the absence of a gold standard typ- 
ically relies on latent structure models. For example, two extensions of latent class models in the 
biostatistics literature, Gaussian random effects (Qu et al., 1996) and finite mixture (Albert and 
Dodd, 2004), form the basis of several recent approaches to estimating sensitivity and specificity 
of diagnostic tests when no (or partial) gold standard evaluation is available. These models attempt 
to account for additional item dependencies that cannot be explained with traditional latent class 
models, where the classes typically correspond to healthy and diseased individuals. 

We propose an alternative latent structure model, namely, the extended mixture Grade of Mem- 
bership (GoM) model, for evaluation of diagnostic tests without a gold standard. The extended 
mixture GoM model allows for test results to be dependent on latent degree of disease severity, 
while also allowing for the presence of some individuals with deterministic response patterns such 
as all-positive and all-negative test results. We formulate and estimate the model in a hierarchical 
Bayesian framework. We use a simulation study to compare recovery of true sensitivity and speci- 
ficity parameters with the extended mixture GoM model, and the latent class, Gaussian random 
effects, and finite mixture models. 

Our findings indicate that when the true generating model contains deterministic mixture com- 
ponents and the sample size is large, all four models tend to underestimate sensitivity and overes- 
timate specificity parameters. These results emphasize the need for sensitivity analyses in real life 
applications when the data generating model is unknown. Employing a number of latent structure 
models and examining how the assumptions on latent structure affect conclusions about accuracy of 
diagnostic tests is a crucial step in analyzing test performance without a gold standard. We illustrate 
the sensitivity analysis approach using data on screening for Chlamydia trachomatis. This example 
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demonstrates that the extended mixture GoM model not only provides us with new latent structure 
and the corresponding interpretation to mechanisms that give rise to test results, but also provides 
new insights for estimating test accuracy without a gold standard. 


7.1 Introduction 

We consider the problem of estimating sensitivity and specificity of diagnostic or screening tests 
when results are available from multiple fallible tests but not from gold standard. This could happen 
when gold standard assessment doesn’t exist or when economic or ethical issues in administering 
the gold standard prevent one from doing so. 

Latent class analysis (Lazarsfeld and Henry, 1968; Goodman, 1974) has been at the core of 
model-based methods for analyzing diagnostic errors in the absence of a gold standard (Hui and 
Zhou, 1998; Albert and Dodd, 2004; Pepe and Janes, 2007). Recently, two extensions of latent 
class models, known as Gaussian random effects (Qu et al., 1996) and finite mixture (Albert and 
Dodd, 2004), have produced several new approaches to estimating sensitivity and specificity of 
diagnostic tests when no (or partial) gold standard evaluation is available (Hadgu and Qu, 1998; 
Albert and Dodd, 2004; 2008; Albert, 2007b). See Hui and Zhou (1998) for a comprehensive review 
of the earlier literature on evaluating diagnostic tests without gold standards. Pepe and Janes (2007) 
criticize latent class models as a tool for analyzing diagnostic test performance because of the lack 
of links between biological mechanisms giving rise to test results and dependencies induced by a 
structure of the model. For example, they assert that most diseases are not dichotomous but occur in 
varying degrees of severity. Hence, latent class models that employ discrete disease status as a latent 
variable cannot account for additional correlations induced by disease severity such as occurrences 
of false negatives for persons with mild disease. One example that Pepe and Janes (2007) provide 
talks about detection of a particular substance in a biological sample where the amount of substance 
affects all test results. 

The best way to evaluate the performance of tests with unknown characteristics is, undoubtedly, 
to have at least a partial gold standard assessment (Albert and Dodd, 2008; Albert, 2007a). In the 
absence of a gold standard, however, having an arsenal of model-based methods can be informative 
for evaluating sensitivity of scientific conclusions regarding accuracy of diagnostic and screening 
tests. Because the true data generating mechanism is typically not known, Albert and Dodd (2004) 
(p. 433) recommends performing sensitivity analysis by using different models: “Although biolog- 
ical plausibility may aid the practitioner in favoring one model over another, a range of estimates 
from various models of diagnostic error (as well as standard errors) should be reported." 

We present an alternative latent structure model for the analysis of test performance when no 
gold standard is available. Our model is an extension of the Grade of Membership model. The 
GoM model employs a degree of disease severity as a latent variable, therefore inducing a mixed 
membership latent structure where individuals can be members of diseased and healthy classes at the 
same time. This type of latent structure addresses the concerns of Pepe and Janes (2007). We extend 
the GoM model to obtain the extended mixture GoM model, analogous to the extended finite mixture 
model by Muthen and Shedden (1999) and the finite mixture model by Albert and Dodd (2004). 
The extended mixture GoM model allows for a mixture of deterministic and mixed membership 
responses. For example, some truly positive individuals may have deterministic positive response on 
every test while others may be subject to diagnostic error according to the GoM model. A version of 
this model has previously been applied in disability studies (Erosheva et al., 2007), but the extended 
mixture GoM model is new to the literature on diagnostic testing. 

The remainder of the chapter is organized as follows. In Section 7.2 we review latent 
class (Lazarsfeld and Henry, 1968; Goodman, 1974), latent class random effects (also known as 
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Gaussian random effects) (Qu et al., 1996), and finite mixture models (Albert and Dodd, 2004) 
that are commonly used for analysis of diagnostic and screening tests. In Section 7.3 we introduce 
the GoM model, develop the extended mixture GoM model that allows for deterministic responses, 
discuss a hierarchical Bayesian framework for estimation of model parameters, and derive sensi- 
tivity and specificity estimates. In Section 7.4 we conduct a simulation study examining recovery 
of specificity and sensitivity parameters under the latent class, the latent class random effects, the 
finite mixture, and the extended mixture GoM models. We investigate performance of each of these 
models when the true data-generating model is known, varying the true model among the four al- 
ternatives. Our findings further emphasize the need for sensitivity analyses when no gold standard 
is available. In Section 7.5 we illustrate such a sensitivity analysis using a publicly available dataset 
on screening for Chlamydia trachomatis (CT) (Hadgu and Qu, 1998) . Finally, in Section 7.6 we 
relate results from our analyses of simulated and real data to prior findings in the literature. 


7.2 Overview of Existing Model-Based Approaches 

Sensitivity and specificity are key accuracy parameters of diagnostic and screening tests. The gen- 
eral framework for estimating diagnostic errors without a gold standard starts by assuming a latent 
structure model and then deriving sensitivity and specificity parameters that correspond to the model 
formulation. This section introduces a common notation and presents a concise overview of latent 
class (Lazarsfeld and Henry, 1968; Goodman, 1974), latent class Gaussian random effects (Qu et al., 
1996), and finite mixture models (Albert and Dodd, 2004) that are commonly used for analysis of 
diagnostic and screening tests. For simplicity of the exposition, we omit the subject index. 

Let x = (x\,X2, ■ ■ ■ , xj) be a vector of dichotomous variables, where x 3 takes on values lj £ 
Cj = {0, 1}, j = 1, 2, . . . , J. Let X = njLi C-j be the set of all possible outcomes l for vector x. 
Denote a positive test result by Xj = 1 and a negative result by Xj = 0. Denote the disease indicator 
by 5, with (5 = 1 standing for the presence of the disease. Let r = P{5 = 1) denote the disease 
prevalence parameter for the population of interest. 

The latent class approach assumes two classes, the healthy and the sick. The probability to 
observe response pattern l is the weighted sum of probabilities to observe l from each latent class: 

P(x = l)= P(x = l\S = 1)P((5 = 1) + P(x = l\S = 0)P((5 = 0), l e X. 

The tests are assumed to be conditionally independent given the true disease status. Test result Xj 
is a Bernoulli random variable with class conditional probabilities Ai j = P(xj = 1|(5 = 1) and 
A 2 j = P(xj = 1|<5 = 0) for a given true disease status. The conditional probabilities Ay, A 2 j,j = 
1, . . . , J and the weight P((5 = 1) = 1 — P(<5 = 0) = r are the model parameters. 

For the jth diagnostic test, its sensitivity is the probability of the positive test result given that 
the true diagnosis is positive, P {x 3 = 1|<5 = 1), and its specificity is the probability of a negative 
response given that the true diagnosis is negative, P (xj = 0|(5 = 0) = 1 — P(xj = 1 1<5 = 0). The 
sensitivity and specificity of test j implied by the latent class model are then simply 

P(*i = 1|<5 = 1) = Ay 

and 

P ( [ Xj = 0|<5 = 0) = 1 - Ay. 

The Gaussian random effects model of Qu et al. (1996) is an attempt to relax the assumption of 
independence conditional on the true disease status. This model assumes that test outcomes are in- 
dependent Bernoulli realizations with probabilities given by the standard normal cdf $ (f3jg + agb). 
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where /3js, 5 = 0, 1; j = 1, . . . , J are latent class parameters and b is an individual-specific standard 
Normal random effect. Under this latent class Gaussian random effects model, 

P(* = l\S) = jy $ (fa + °sb) li (1 - * (fa + (reb)) 1 - 1 * j f(b)db, 

where <p(b) is the standard normal density. The sensitivity and specificity for test j under the latent 
class Gaussian random effects model are then 


P(x, = 1|« = 1, = * (<T^) 

and 

P(*,=0|i = 0) = l — 

respectively. 

The finite mixture model (Albert and Dodd, 2004) also uses the two-class structure as its basis 
and adds two point masses for the combinations of all-zero and all-one responses. These point 
masses correspond to the healthiest and the most severely diseased patients that are always classified 
correctly. Let t be an indicator that denotes correct classification. Specifically, let t = 0 if a healthy 
subject is always classified correctly (i.e., has the all-zero response pattern with J tests), t = 1 if 
a diseased subject is always classified correctly, and let t = 2 otherwise. Thus, subjects are either 
always classified correctly, when either t = 0 or t = 1, or a diagnostic error is possible when t = 2. 
Denote the probabilities for correctly classifying diseased and healthy subjects by rji = P(f = 1) 
and ?7 q = P(i = 0), respectively. Let also Wj(Si) denote the probability of the jth test making a 
correct diagnosis when t = 2. 

The finite mixture model of Albert and Dodd (2004) assumes that the test results x 3 are inde- 
pendent Bernouilli random variables, conditional on the true disease status and the classification 
indicator. Thus, 

{ Wj( 1), if 5 = 1, and t = 2 

1, if S = 1, and t = 1 

1 — Wj(0), if d = 0, and t = 2 

0, if 5 = 0, and t = 0. 

Note that P(xj = 1|<S = 1, t = 0) = P (xj = 1|<5 = 0, t = 1) = 0. The specificity and sensitivity of 
the jth test under the finite mixture model are then 

P(a 'j = 1|<5 = 1) = r?i + (1 - r]i)wj(l) 


and 

P (xj = 0|<5 = 0) = rjo + (1 - i7o)u>j(0), 


respectively. 


7.3 A Mixed Membership Approach to Estimating Diagnostic Error 

The GoM model can be thought of as a different extension of latent class models where random 
effects are individual-specific grades of membership (Erosheva, 2005). The extended GoM mixture 
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model combines individuals of mixed membership with those of full membership who have pre- 
determined response patterns. Although the extended mixture GoM model allows for an arbitrary 
choice of the number and the nature of deterministic response patterns, it is reasonable to assume 
two deterministic responses in the medical testing context. Analogous to the approach of Albert 
and Dodd (2004), we use two deterministic components in the extended mixture GoM model to 
allow for inclusion of some healthy and diseased individuals who have deterministic responses with 
the all-zero and all -one patterns, respectively. However, to model tests’ diagnostic errors for other 
subjects in the population, our approach is to use the GoM model while the finite mixture model of 
Albert and Dodd (2004) relies on using the two-class latent class model to model diagnostic errors. 

Next, we describe the GoM model before introducing the extended mixture GoM model and 
deriving a Bayesian estimation algorithm for the extension. 

7.3.1 The Grade of Membership Model 

As before, let x = (xi, & 2 > • ■ • , Xj) be a vector of dichotomous variables, where x 3 takes on values 
lj £ Cj = {0, 1}, j = 1, 2, . . . , J. Let K be the number of mixture components (extreme profiles) 
in the GoM model. To preserve generality, we will provide notation and estimation algorithms for 
an arbitrary value of K. However, in the medical testing context, we will assume K — 2 to be 
consistent with the existing literature. 

Let g = (gi, <721 • ■ • > 9k) be a latent partial membership vector of K nonnegative random vari- 
ables that sum to 1. In what follows, we use notation p( ) to refer to both probability density and 
probability mass functions. Each extreme profile is characterized by a vector of conditional response 
probabilities for manifest variables, given that the fcth component of the partial membership vec- 
tor is 1 and the others are zero, A kj = p( x j = 1| 9k = 1)? k = 1,2, ..., K; j = 1, 2, . . . , J. 
Given partial membership vector g £ [0, 1 K , the conditional distribution of manifest variable x 3 is 
given by a convex combination of the extreme profiles’ response probabilities, i.e., p(x 3 = 1| g) = 
1 9k^kj,j = 1,2, ... ,J. Let us denote the distribution of g by D(g). The local independence 
assumption states that manifest variables are conditionally independent, given the latent variables. 
Using this assumption and integrating out latent variable g, we obtain the marginal distribution for 
response pattern l in the form of a continuous mixture 

II ( - x kj)^ l n dD(g), l £ X, 

j = i \fc=i / 

where X = Ilj=i £j ' s the set of all possible outcomes for vector x. 

The latent class representation of the GoM model leads naturally to a data augmentation ap- 
proach (Tanner, 1996). Denote by x the matrix of observed responses x,j for all subjects. Let A 
denote the matrix of conditional response probabilities. Augment the observed data for each subject 
with realizations of the latent classification variables z, = (zn , . . . , z, j). Denote by z the matrix of 
latent classifications z,j. Let z l3 f. ; = 1, if z i3 = k and Zijk = 0 otherwise. 

We assume the distribution of membership scores is Dirichlet with parameters a. The joint 
probability model for the parameters and augmented data is 

N 

p(x, Z, g, A, a) =p{ A, a)Y\_\p{.z i \g i )p(x i \\, Zi) ■ Dir{g t \a)\ , 

i = 1 


p{x = 1) = J 


J K 


J K 


p( z i\9i) = YiYi 9 ^ , v{xi\\,zi) = n n {^ki^ 1 - xij ) 




j— i fc=i 


3 — 1 fc=l 


where 
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and 


Dir(gt\a) 


r(E fc « fc ) GLl—1 n Ct-K~ 1 

T{ai)...T(a K ) 9il '" 9lK 


We assume the prior on extreme profile response probabilities A is independent of the prior on the 
hyperparameters a. We further assume that the prior distribution of extreme profile response proba- 
bilities treats items and extreme profiles as independent, hence p( A, a) = p(a) J+ =1 Il/=i P(^kj)- 
We assume the prior p(Akj) is Beta( 1, 1). Estimation of the GoM model can be done via a 
Metropolis-Hastings within Gibbs algorithm as described by Erosheva (2003). 


7.3.2 The Extended Mixture GoM Model 

To define the extended mixture GoM model, we assume two patterns of deterministic responses that 
correspond to the all-zero and all-one test results. Similar to Albert and Dodd (2004), we introduce 
the classification indicator variable t. Let t = 0 for the healthiest individuals (S = 0) who are 
always classified correctly; let t = 1 for the sick individuals (<5 = 1) who are always classified 
correctly; and let t = 2 for the other individuals whose distribution of test results is given by the 
GoM model with parameters a, A. Denote the respective weights for the multinomial distribution 
of t by 9 = (0q, 9i, 62 )- The interpretation of 9q and 9 1 is similar to that of 770 and 771 in the finite 
mixture model of Albert and Dodd (2004); we are using a different notation symbol to emphasize 
that the values of those parameters will be different due to differences between the models. 

Note that parameter estimation for the extended mixture GoM model would be identical to the 
estimation for the standard GoM model if we could modify the observed counts for the all-zero 
and all-one responses by subtracting the numbers of individuals who are always classified correctly. 
However, these numbers are typically unknown which means that we have to estimate weights of 
the deterministic components. 

To derive the Markov chain Monte Carlo (MCMC) sampling algorithm for the extended mixture 
GoM model, we further augment data with individual classification indicators. Let N be the total 
number of individuals in the sample, and let n ^ 1 ' and iif 1 be the expected values of the all-zero cell 
count and the all-one cell count, respectively, for the mixed membership individuals (with t = 2) at 
the m-th iteration. Denote the number of individuals with at least one positive and at least one zero 
response in their response pattern by n mlx . The total number of individuals with t = 2 at the 777th 
iteration is then riQ^ M = ti '™' 1 + n ^ 1 + n m i X . Let the prior distribution for weights 9 be uniform 
on the simplex and update 9 at the end of the posterior step with: 


a( m + 1) 
^0 


(m) _ (m+1) 

Q (m) n 0 n 0 ( m+ 1 ) ( m ) 

u o + Jj > U 1 - 


(m) (m+1) 

n\ — n\ 

N 


and 


^(m+l) 


7+1) (n 

+ n\ 


HI) 


+ tin 


N 


1-9 


(m+1) 

0 


l(m+l) 

1 


Given the number of individuals subject to classification error, the estimation of model 

parameters for the stochastic GoM compartment is identical to that used in the case of the standard 
GoM model. We use a reparameterization of a = (o+, . . . , ax) with £ = (£1 , . . . , £+ and ao, 
which reflect proportions of the item responses that belong to each mixture category and the spread 
of the membership distribution. The closer ct 0 is to zero, the more probability is concentrated near 
the mixture categories; similarly, the larger o ( | is, the more probability is concentrated near the 
population average membership score. We assume that o ( , and £ are independent since they govern 
two unrelated qualities of the distribution of the GoM scores. In the absence of a strong prior opinion 
about hyperparameters ao and we take the prior distribution p(f ) to be uniform on the simplex 
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and p(a o) to be a proper diffuse gamma distribution. We also assume that the prior distribution 
on the GoM scores is independent of the prior distribution on the structural parameters. The joint 
distribution of the parameters and augmented data for the mixed membership component is 


pWp{<xo)p(Q 


n -GoM \ n GoM J K 

n D( gi \a)\ n nn(^d 


i —1 j= 1 k —1 



where is the latent class indicator as before. We sample from the posterior distribution of 
£ = (£ 1 , . . . , £k) and «o by using the Gibbs sampler with two Metropolis-Hastings steps (see 
Erosheva, 2003). The modified sampling algorithm for the extended mixture GoM model can be 
easily generalized to a number of deterministic response patterns greater than two. 


7.3.3 Sensitivity and Specificity with the Extended Mixture GoM Model 


Here we derive sensitivity and specificity estimates under the extended mixture GoM model. As 
before, let us denote the true diagnosis of a subject by S = 1 or 5 = 0 for the presence or absence of 
the disease, respectively. If patient i has the disease, then, under the extended mixture GoM model, 
this patient either belongs to the deterministic compartment with the clear positive diagnosis, t = 1, 
or they belongs to the stochastic compartment with the classification indicator t = 2. In terms of 
probability, this translates into P(d = 1) = d\ + 0 2 £i. As a consequence, the sensitivity of item j 
can be expressed as follows : 


P(^ = 1|J = 1) 


Noticing that 


P(xj = l,t = l\S = l) + Pr( Xj = l,t = 2\6 = l) 
[P (t = 1) + P( Xj = 1, t = 2, S = 1)] /P(<5 = 1). 


P (xj = l,t = 2,6= 1) =P (xj =ljt = 2,S = 1)P(S = l|t = 2)P(f = 2), 


we obtain a parametric form for sensitivity of item j under the extended mixture GoM model: 


p( Xj = i\6 = i) 


01 , > 02^1 
01+02^1 lj 01 +026' 


Similarly, the absence of the disease for patient i means that either i belongs to the deterministic 
compartment, t = 0 of a clear negative diagnosis, or he/she belongs to the stochastic compartment 
with classification indicator t = 2. Therefore, P(<5 = 0) = 9q + 0 2 ^ 2 and we obtain 


P(a- = 1|<5 = 0) = P(xj = l,t = 0|<5 = 0) +P{xj = l,t = 2\S = 0) 

= P{x 0 = l,t = 0|<5 = 0) +P(^ = l,t = 2,6 = 0)/P(8 = 0). 


Because P(xj = 1, t = 0|<5 = 0) = 0, we have 

P(xj = 1|<5 = 0 ) 


A 2i 


0 2 ? 2 

00 + 02^2 ’ 


and the specificity estimate of item j under the extended mixture GoM model can be obtained as 

1 - P (xj = 1|<5 = 0). 


7.4 Simulation Study 

In this section, we present a simulation study with the primary aim to examine recovery of sensitivity 
and specificity parameters under the four different latent structure models: the latent class (Lazars- 
feld and Henry, 1968; Goodman, 1974), the latent class random effects (Qu et al., 1996), the finite 
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mixture (Albert and Dodd, 2004), and the extended mixture GoM models introduced earlier. We 
investigate performance of each of these models when the true model is known, varying the true 
model among the four alternatives under two sample sizes, N = 1000 and N = 4000. We also 
report the comparative fit of the models in each case, however, model fit was not a primary goal 
of our study. Earlier work demonstrated difficulties in distinguishing between models with differ- 
ent dependence structures (Albert and Dodd, 2004), and pointed out that even equally well-fitting 
models may result in different accuracy estimates (Begg and Metz, 1990). 

For the simulations we considered J = 6 and set the true specificities and sensitivities to be the 
same for all 6 items. Specifically, we used the value of 0.9 for the sensitivity parameter and 0.95 for 
the specificity. This setting allowed us to examine the recovery of accuracy parameters for a given 
model by simply computing the respective average sensitivity and specificity estimates across all 
items, and the hyperparameter was set at «o = 0.25 

We selected data generating designs under each model to reflect important features of biomed- 
ical screening and diagnostic data. Most noticeably, contingency tables formed on the basis of this 
type of data often contain many zeros and small observed cell counts but also have several large cell 
counts. The large observed cell counts typically include the all-zero and the all-one response pat- 
terns. The following parameter choices produced simulated data with many zeros and large all-zero 
and all-one counts and items with 0.9 sensitivity and 0.95 specificity: 

1. For data generated under the latent class model, we chose: r = 0.1 and Ay = 0.9, Ay = 0.05, 
for all j. 

2. For data generated under the latent class random effects model, we chose: <to = <J\ = 1.5, 
r = 0.1, and /3 j0 = -2.965, [l n = 2.31, for all j. 

3. For data generated under the finite mixture model, we chose: t = 0.1, r/o = 0.2, rp = 0.5, and 
iOj(0) = 0.9375, Wj(l) = 0.8, for all j. 

4. For data generated under the extended mixture Grade of Membership model, we chose the 
following parameter values: 6 = (0.85,0.05,0.10), a = (0.02,0.06), and Ay = 0.7, A 2 j = 
0.6, for all j. 

Among the four models, the latent class is the least complex with 13 independent parameters; the 
latent class random effects and the finite mixture models both have 15 independent parameters, and 
the extended mixture GoM model is the most complex with 16 independent parameters. We used 
BUGS (Bayesian inference using Gibbs sampling) to estimate the latent class, latent class random 
effects, and finite mixture models. We used a C code for estimation of the extended mixture GoM 
model. 

Tables 7. 1-7.4 report posterior means and standard errors of the sensitivity and specificity 
parameters, averaged over the six items for each model value of the log-likelihood, as well as 
goodness-of-fit criteria. We report the log-likelihood, the G 2 likelihood ratio criteria (Bishop et al., 
1975), and the truncated sum of squared Pearson residuals (SSPR) (Erosheva et al., 2007) com- 
puted for observed counts larger than 1 (i.e., the sum did not include residuals for the cells with zero 
observed counts). The log-likelihood and the goodness-of-fit criteria were evaluated as the posterior 
means of the parameters for each model. 
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TABLE 7.1 

Results for the LCM generating model. 


N=1000 


Criterion 

LCM 

LCRE 

FM 

ExtM-GoM 

SSPR xir 

23.44 

22.68 

18.47 

22.14 

G 2 

30.24 

31.11 

34.59 

54.54 

Log-likelihood 

-1528.19 

-1528.62 

-1530.36 

-1545.39 

Sensitivity 

0.894 (0.032) 

0.890 (0.033) 

0.893 (0.032) 

0.910(0.029) 

Specificity 

0.951 (0.007) 

0.951 (0.007) 

0.950 (0.007) 

0.953 (0.006) 



N= 

4000 


Criterion 

LCM 

LCRE 

FM 

ExtM-GoM 

SSPR xir 

49.42 

44.61 

40.90 

178.23 

G 2 

55.73 

61.70 

56.19 

280.96 

Log-likelihood 

-6418.15 

-6421.13 

-6418.38 

-6525.79 

Sensitivity 

0.902 (0.015) 

0.895 (0.016) 

0.901 (0.016) 

0.913 (0.009) 

Specificity 

0.948 (0.004) 

0.949 (0.004) 

0.948 (0.004) 

0.953 (0.005) 


TABLE 7.2 

Results for the LCRE generating model. 




N= 

1000 


Criterion 

LCM 

LCRE 

FM 

ExtM-GoM 

SSPR xir 

308.75 

54.98 

36.45 

34.55 

G 2 

232.28 

62.60 

62.02 

59.57 

Log-likelihood 

-1301.07 

-1216.24 

-1215.95 

-1221.95 

Sensitivity 

0.838 (0.031) 

0.886 (0.040) 

0.878 (0.039) 

0.905 (0.010) 

Specificity 

0.969 (0.005) 

0.958 (0.009) 

0.971 (0.007) 

0.976 (0.004) 



N= 

4000 


Criterion 

LCM 

LCRE 

FM 

ExtM-GoM 

SSPR Xtr 

710.79 

60.92 

65.74 

61.13 

G 2 

514.76 

73.25 

76.32 

62.11 

Log-likelihood 

-5283.37 

-5062.62 

-5064.15 

-5069.51 

Sensitivity 

0.868 (0.016) 

0.938 (0.013) 

0.876 (0.019) 

0.936 (0.008) 

Specificity 

0.973 (0.003) 

0.955 (0.005) 

0.970 (0.003) 

0.976 (0.005) 
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TABLE 7.3 

Results for the FM generating model. 


N=1000 


Criterion 

LCM 

LCRE 

FM 

ExtM-GoM 

SSPR xir 

159.19 

55.07 

43.91 

43.40 

G 2 

100.70 

71.37 

75.01 

80.61 

Log-likelihood 

-1517.72 

-1503.05 

-1504.87 

-1513.80 

Sensitivity 

0.928 (0.027) 

0.898 (0.053) 

0.907(0.034) 

0.915 (0.021) 

Specificity 

0.949 (0.007) 

0.953 (0.008) 

0.950 (0.008) 

0.954 (0.006) 

N=4000 

Criterion 

LCM 

LCRE 

FM 

ExtM-GoM 

SSPR xir 

109.61 

44.53 

36.38 

93.08 

G 2 

91.77 

53.30 

43.89 

108.38 

Log-likelihood 

-5287.40 

-5268.16 

-5263.46 

-5292.61 

Sensitivity 

0.838 (0.019) 

0.841 (0.023) 

0.842 (0.020) 

0.864 (0.011) 

Specificity 

0.969 (0.003) 

0.968 (0.003) 

0.968 (0.003) 

0.969 (0.002) 


TABLE 7.4 

Results for the extended mixture GoM generating model. 




N= 

1000 


Criterion 

LCM 

LCRE 

FM 

ExtM-GoM 

SSPR xir 

174.34 

48.77 

30.94 

35.70 

G 2 

138.55 

80.85 

66.67 

62.61 

Log-likelihood 

-887.45 

-858.59 

-851.50 

-854.46 

Sensitivity 

0.791 (0.033) 

0.843 (0.050) 

0.857 (0.067) 

0.835 (0.023) 

Specificity 

0.998 (0.002) 

0.998 (0.004) 

0.974 (0.014) 

0.979(0.002) 



N= 

4000 


Criterion 

LCM 

LCRE 

FM 

ExtM-GoM 

SSPR Xtr 

394.06 

193.19 

42.32 

41.57 

G 2 

328.31 

270.28 

53.14 

51.48 

Log-likelihood 

-3463.84 

-3443.51 

-3326.26 

-3330.02 

Sensitivity 

0.772 (0.018) 

0.874 (0.019) 

0.789 (0.024) 

0.858 (0.006) 

Specificity 

0.999 (0.001) 

0.969 (0.003) 

0.991 (0.005) 

0.962 (0.002) 
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A first remark concerning the simulation results is that the fit criteria do not always favor a 
generating model. For example, perhaps not surprisingly, the truncated sum of squares of Pearson’s 
residuals tends to be better for the finite mixture and the extended mixture GoM model. These 
models provide perfect fit for the two largest observed counts, even when they are not the true 
data-generating models. In general, however, we observe that different latent structure models can 
produce similar fit; this finding confirms that it can be difficult to distinguish between models with 
different dependence structures (Albert and Dodd, 2004). 

Examining bias for the sensitivity and specificity estimates, we observe that when the latent class 
model is used to generate the data, all of the models successfully recover the accuracy parameters 
for both cases, N = 1000 and N = 4000. 

The success in recovery of the accuracy parameters is not that great when more complex models 
are used to generate the data. Thus, when the latent class random effects model generates the data, 
even the true model recovers only the specificity parameter, but not the sensitivity parameter. For 
the latent class random effects as the generating model, the extended mixture GoM does best in 
recovering the sensitivity for the smaller sample size, but the finite mixture model does best in 
recovering the sensitivity for the larger sample size. When the finite mixture model is the generating 
model, all models perform well in recovering the sensitivity and specificity for the N = 1000 case, 
however, all models perform poorly for the N = 4000 case. In the latter scenario, we see that 
all the models considered, including the true model, underestimate the sensitivity and overestimate 
the specificity parameters. When the extended mixture GoM model is the generating model, we 
observe that the latent class random effects model does better with recovering the true value of the 
sensitivity parameter, even though the fit of this model is not as good compared to others. We also 
observe that all models tend to overestimate the specificity parameter for data generated with the 
extended mixture GoM model. 

Finally, we observe that in the most difficult cases, the sensitivity parameter was underestimated 
to various degree by all of the models and the specificity — overestimated by all of the models. In 
such cases, models that provide larger sensitivity estimates and smaller specificity estimates could 
be considered especially informative in a sensitivity analysis with respect to latent structure as- 
sumptions. We also note that the sensitivity and especially specificity estimates under the extended 
mixture GoM model had smaller standard errors as compared to those of the other models, indepen- 
dently of the true model. 


7.5 Analysis of Chlamydia trachomatis Data 

In this section, we provide sensitivity analysis for a dataset on testing for Chlamydia trachomatis, 
originally considered by Hadgu and Qu (1998). Chlamydia trachomatis (CT) is the most common 
sexually transmitted bacterial infection in the U.S. The data contain binary outcomes on J = 6 tests 
for 4,583 women where positive responses correspond to detection of the disease. The six tests are 
Syva-DFA, Syva-EIA, Abbott-EIA, GenProbe, Sanofi-EIA, and a culture test (for more information, 
see Hadgu and Qu, 1998). Table 7.5 provides response patterns with positive observed counts. In 
our analyses, we follow Hadgu and Qu (1998), who only retained complete response patterns and 
assumed that patterns with zero observed counts were sampling and not structural zeros. 

We assumed two basis latent classes and analyzed the CT data with the latent class, the latent 
class random effects, the finite mixture, and the extended mixture GoM models. For comparative 
purposes, we also provide results obtained with the GoM model, although we did not expect it to 
perform well based on our earlier experience with disability data (Erosheva et al., 2007). 

To obtain draws from the joint posterior distribution, we used the same prior and proposal distri- 
butions for the standard GoM as well as for the extended mixture GoM models. We chose Gamma 
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as the prior distribution on cto, with the shape and the inverse scale parameter equal to 1, and chose 
the uniform prior for ^-parameters. We chose the shape parameter for the proposal distribution on 
a 0 to be equal to C\ = 100, and the sum of the parameters of the proposal distribution for £ to be 
equal to C 2 = 1. We set starting values for A to the estimated conditional response probabilities 
from the latent class model with two classes, and selected the starting value for hyperparameters a 
to be (0.001,0.099). 

We monitored the convergence of MCMC chains using Geweke convergence diagnos- 
tics (Geweke, 1992), and Heidelberger and Welch stationarity and interval halfwidth tests (Hei- 
delberger and Welch, 1983). Furthermore, we visually examined plots of successive iterations. With 
our choices of starting values and parameters for prior and proposal distributions, all these methods 
indicated favorable convergence of MCMC chains for both the standard and the extended mixture 
GoM models. 

Under the extended mixture GoM model, the estimated proportion 9q of women belonging to the 
deterministic compartment of a clear negative diagnosis was 9q = 0.917 (sd = 0.009). The observed 
probability of all-zero response was 0.944. The estimated proportion 6\ of women belonging to 
the deterministic compartment of a clear positive diagnosis was 6-\ = 0.017 (sd=0.009), while the 
observed probability of all-one response was 0.019. Thus, consistently with the idea of screening 
tests in medicine, about 97% of individuals with negative results on all six tests were healthy with 
probability 1 while about 89% of individual with positive results on all six tests were diseased with 
probability 1. Table 7.6 shows the posterior means and standard deviation estimates of parameters 
A kj, £, oto for the GoM portion of the extended mixture GoM model. We observe that the extreme 
profile k = 1 represents women with likely positive CT diagnosis, while extreme profile k - 2 
represents women with likely negative CT diagnosis. 

Table 7.5 shows the observed and expected cell counts as well as the number of parameters, the 
degrees of freedom, and the values of the likelihood ratio statistic G 2 (see Bishop et al., 1975) for 
all five models considered. As expected, we observe that the fits of the standard GoM model and 
the latent class model are rather poor. The extended mixture GoM model provides a comparable 
performance in fit compared to the latent class random effects model of Hadgu and Qu (1998) and 
the finite mixture model of Albert and Dodd (2004), however, it has one less degree of freedom. 

Table 7.7 provides the sensitivity and specificity estimates for the six tests under the five models 
considered. We see that the sensitivity and specificity estimates are rather high. Given our findings 
from the simulation study that pointed towards widespread overestimation of specificity and under- 
estimation of sensitivity, we should pay particular attention to the smallest specificity and largest 
sensitivity estimates. For specificity, the extended mixture GoM model and the finite mixture model 
provide the smallest values that range from about 0.993 for the culture test to about 0.997 for the 
Syva-DFA test. For sensitivity, the extended mixture GoM model provides the largest values ranging 
from about 0.757 for the Abbott-EIA test to about 0.985 for the culture test. Thus, it appears that the 
extended mixture GoM model contributes new information beyond that available from other latent 
structure models, for estimating accuracy parameters of diagnostic tests. 
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TABLE 7.5 

Observed and expected cell counts under LCM, LORE, GoM, ExtM-GoM, and FM models. X 2 is 
the truncated sum of squared Pearson residuals for observed cell counts > 1; n is the number of 
independent parameters; df is degrees of freedom. 



Response 

Observed 

Expected 

Expected 

Expected 

Expected 

Expected 


pattern 

counts 

LCM 

LCRE 

GoM 

ExtM-GoM 

FM 

1 

111111 

87 

53.4 

87.30 

437.31 

87 

87.18 

2 

111110 

2 

0.85 

0.12 

11.53 

0.50 

0.40 

3 

111101 

9 

17.26 

8.19 

146.20 

7.64 

7.07 

4 

111011 

2 

8.47 

2.93 

73.43 

2.60 

2.64 

5 

110111 

9 

21.2 

11.79 

167.21 

10.27 

9.52 

6 

110101 

6 

6.85 

5.03 

56.78 

6.98 

6.68 

7 

110011 

1 

3.36 

1.98 

29.64 

2.42 

2.5 

8 

110001 

6 

1.09 

2.09 

10.27 

1.91 

1.85 

9 

101111 

7 

11.57 

4.91 

91.33 

4.44 

4.05 

10 

101001 

1 

0.59 

0.93 

6.93 

0.97 

0.87 

11 

loom 

4 

4.59 

3.15 

36.12 

4.08 

3.84 

12 

100101 

1 

1.48 

3.26 

13.57 

3.08 

2.80 

13 

100011 

2 

0.73 

1.32 

8.89 

1.20 

1.11 

14 

100001 

5 

0.27 

3.15 

4.17 

2.09 

1.87 

15 

100000 

7 

7.50 

6.92 

6.27 

6.91 

7.23 

16 

011111 

6 

10.58 

4.14 

93.3 

4.02 

3.72 

17 

011101 

1 

3.42 

1.93 

33.37 

2.80 

2.62 

18 

011011 

1 

1.68 

0.76 

15.54 

1.00 

1.00 

19 

011001 

1 

0.54 

0.81 

5.88 

0.97 

0.87 

20 

011000 

5 

0.07 

4.96 

1.13 

1.26 

1.22 

21 

010111 

1 

4.20 

2.70 

36.34 

3.72 

3.53 

22 

010101 

5 

1.36 

2.81 

13.30 

2.89 

2.64 

23 

010001 

2 

0.28 

2.71 

3.93 

2.59 

2.41 

24 

010000 

9 

13.56 

8.92 

7.29 

10.34 

10.91 

25 

001111 

2 

2.29 

1.21 

20.76 

1.65 

1.51 

26 

001101 

1 

0.74 

1.25 

9.46 

1.47 

1.28 

27 

001001 

1 

0.20 

1.18 

2.78 

2.54 

2.49 

28 

001000 

14 

18.46 

13.88 

10.24 

12.72 

13.64 

29 

000111 

2 

0.91 

1.77 

9.53 

1.82 

1.61 

30 

000101 

4 

0.37 

4.25 

5.12 

3.39 

3.12 

31 

000100 

16 

16.54 

15.82 

11.46 

12.45 

13.15 

32 

000011 

3 

0.22 

1.70 

3.33 

2.43 

2.31 

33 

000010 

15 

15.73 

14.85 

9.84 

11.40 

11.98 

34 

000001 

17 

20.02 

17.04 

11.45 

19.50 

20.50 

35 

000000 

4328 

4321.11 

4328.30 

3123.48 

4328 

4322.71 

x 2 



604.95 

48.61 

1529.89 

43.29 

48.70 

G 2 



179.68 

42.25 

1141.70 

61.44 

76.24 

n 



13 

15 

14 

16 

15 

df 



50 

48 

49 

47 

48 
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TABLE 7.6 

Posterior mean (standard deviation) estimates for the diagnostic error component of the extended 
mixture GoM model with two extreme profiles (stochastic compartment). 



k 

= 1 

k 

= 2 

Afc.i 

0.750 

(0.058) 

0.050 

(0.022) 

Afc,2 

0.732 

(0.058) 

0.076 

(0.027) 

Afc,3 

0.534 

(0.065) 

0.093 

(0.027) 

Afc,4 

0.822 

(0.057) 

0.089 

(0.029) 

Afc,5 

0.608 

(0.068) 

0.084 

(0.026) 

Afc,6 

0.970 

(0.021) 

0.134 

(0.042) 

ao 

0.075 

(0.057) 



& 

0.278 

(0.044) 

0.722 

(0.044) 


TABLE 7.7 

Sensitivity and specificity (standard deviation) of the diagnostic tests for LCM, LCRE, GoM, ExtM- 
GoM, and FM models. 


Test 

LCM 

LCRE 

GoM 

ExtM-GoM 

FM 

Specificity 

Syva-DFA 

0.998 (0.001) 

0.999 (0.001) 

0.999 (0.001) 

0.998 (0.001) 

0.997 (0.001) 

Syva-EIA 

0.997 (0.001) 

0.997 (0.001) 

0.999 (0.001) 

0.996 (0.001) 

0.996 (0.001) 

Abbott-EIA 

0.996(0.001) 

0.997 (0.001) 

0.998 (0.001) 

0.995 (0.001) 

0.995 (0.001) 

GenProbe 

0.996(0.001) 

0.997 (0.001) 

0.998 (0.001) 

0.996 (0.001) 

0.995 (0.001) 

Sanofi-EIA 

0.996(0.001) 

0.997 (0.001) 

0.997 (0.001) 

0.996 (0.001) 

0.996 (0.001) 

Culture 

0.995 (0.001) 

0.998 (0.001) 

0.998 (0.001) 

0.994 (0.001) 

0.993 (0.001) 

Sensitivity 

Syva-DFA 

0.835 (0.030) 

0.747 (0.048) 

0.825 (0.034) 

0.870 (0.030) 

0.860 (0.030) 

Syva-EIA 

0.822 (0.031) 

0.731 (0.048) 

0.824 (0.034) 

0.861 (0.030) 

0.851 (0.032) 

Abbott-EIA 

0.716 (0.036) 

0.636 (0.047) 

0.721 (0.037) 

0.757 (0.035) 

0.748 (0.037) 

GenProbe 

0.863 (0.028) 

0.777 (0.047) 

0.864 (0.031) 

0.907 (0.030) 

0.892 (0.029) 

Sanofi-EIA 

0.756 (0.034) 

0.678 (0.048) 

0.751 (0.037) 

0.796 (0.036) 

0.786 (0.036) 

Culture 

0.984 (0.011) 

0.935 (0.034) 

0.978 (0.013) 

0.985 (0.010) 

0.980 (0.011) 
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7.6 Conclusion 

We presented the extended mixture GoM model as an alternative latent structure model for analyzing 
accuracy of diagnostic and screening tests. For the medical testing case, the extended mixture GoM 
model accommodates two types of individuals, those whose diagnosis is certain, independently of 
the test, and those who are subject to diagnostic error. The latter individuals can be thought of as 
stochastic “movers” because they may change in their disease status depending on the diagnostic test 
considered; this idea is analogous to longitudinal mover-stayer models (Blumen et al., 1955). In the 
extended mixture GoM model, the “stayers” have predetermined response patterns that correspond 
to particular cells in a contingency table. The extended GoM mixture model can also be seen as a 
combination of latent class and GoM mixture modeling, analogous to the extended finite mixture 
model (Muthen and Shedden, 1999). Similar to the finite mixture model of Albert and Dodd (2004), 
the extended GoM mixture model allows for some individuals to always be diagnosed correctly, 
however, it relies on the assumption of disease severity for modeling diagnostic error for the other 
individuals while the finite mixture model of Erosheva (2005) relies on the assumption of two latent 
classes for the same purpose. 

To estimate the extended mixture GoM model, we used a hierarchical Bayesian approach to 
estimation, drawing on earlier work (Erosheva, 2003). Although we did not use informative priors 
in our examples, Pfeiffer and Castle (2005) and Albert and Dodd (2004) specifically mention the 
promise of Bayesian approach in estimating diagnostic error without a gold standard when good 
prior information is available. 

Our findings with the simulation study and with the Chlamydia trachomatis (Hadgu and Qu, 
1998) data further emphasize the need of carrying out sensitivity analyses when no gold standard is 
available (Albert and Dodd, 2004) . When the underlying latent structure is unknown, it is important 
to examine sensitivity of scientific conclusions regarding the estimated accuracy of diagnostic and 
screening tests to the latent structure assumptions. 

This could be done by comparing test accuracy estimates across a number of different latent 
structure models. Our results demonstrate that the extended mixture GoM model can provide us 
with new information on estimating test sensitivity and specificity beyond that provided by the 
latent class (Lazarsfeld and Henry, 1968; Goodman, 1974), latent class Gaussian random effects (Qu 
et al., 1996), and finite mixture models (Albert and Dodd, 2004). In addition, the extended mixture 
GoM model offers a plausible interpretation for diagnostic and screening test results by combining 
the idea of diagnostic error that depends on disease severity for some individuals with the idea of 
certain diagnosis for the other, typically the most healthy and the most sick individuals. 

Finally, the flexible framework of mixed membership models can be used to modify the extended 
mixture GoM model and to address, for example, diagnostic cases when disease status is ordinal and 
test results do not come in binary form (Wang and Zhou, 2012). Drawing on the recent development 
in the class of mixed membership models that includes the GoM model as a special case, one can 
also modify the model to accommodate, for example, outcomes of mixed types, multiple basis 
categories in the latent structure, and correlations among membership scores. 
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Although shared membership of individuals in two or more categories of a classification scheme is 
a distinguishing feature of the family of mixed membership models, relatively few analyses using 
these models pay much attention to this special feature. Most published analyses to-date focus on 
identifying and interpreting the extreme, or ideal, types consistent with a given body of data, thereby 
in effect using mixed membership models as crisp clustering techniques. Getting into the domain 
of shared membership quickly places the investigator in a difficult position, as standard estimation 
strategies produce a large number of ideal profiles, almost always greater than six, that represent 
best fitting representations of the data, while at the same time making it impossible to interpret what 
membership in, say, four or more profiles actually means. This conflict between statistical goodness- 
of-fit and subject-matter-based interpretability of shared membership cannot usually be resolved 
using conventional mixed membership models. We show that by introducing separate mixed mem- 
bership models, each containing a small number of ideal profiles, to describe a population according 
to responses focused on distinct subject matter domains, and at the same time producing a vector 
of correlated grade of membership scores for the individuals, interpretation of shared memberships 
across the distinct subject matter domains becomes feasible. Deciding on what constitutes a good 
model requires tradeoffs between statistical goodness-of-fit criteria and frequently non-quantifiable 
subject-matter-based interpretation. We illustrate these unavoidable tradeoffs in several epidemio- 
logical contexts. 


8.1 Introduction 

Mixed membership models are ideally suited for characterizing heterogeneous populations where 
many individuals have multiple characteristics of interest, no combination of which occur at high 
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frequency in an overall population. Examples of this phenomena are: (i) representation of the health 
status of elderly populations at the community level (Berkman et al., 1989); (ii) prevalence of, 
and variation in, co-infection with tropical diseases at the district level (Keiser et al., 2002; Raso 
et al., 2006); (iii) characterization of environmental and behavioral risks for Chagas disease in Latin 
America (Chuit et al., 2001); (iv) representation of malaria risk in complex eco-systems (e.g., the 
Brazilian Amazon) (Castro et al., 2006); and (v) represention of variation in disability among the 
elderly in the United States (Manton et al., 2006). Although mixed membership models have been 
fit to data from the above and other settings, very little attention has been given to the severe lack of 
interpretability of ‘shared membership’ of characteristics of individuals among a family of extreme, 
or ideal, types. Indeed, most published analyses to-date using this class of models focus on inter- 
pretation of ideal type profiles, and neglect the more complex stories inherent in the phenomenon 
of shared membership. This is tantamount to using a mixed membership model as a crisp cluster- 
ing methodology and bypassing the analysis of shared membership, which is the key distinguishing 
feature of the technology. 

Standard estimation strategies for mixed membership models, if used in an exploratory mode, 
produce a number K of ideal profiles that lead to the best fitting and parsimonious representations of 
the multidimensional data. The optimum value of K can be defined based on a measure of relative 
goodness-of-fit of models run for several K values (Manton et al., 1994; Airoldi et al., 2010; White 
et al., 2012; Suleman, 2013). However, interpreting these profiles can be rather complex. Assuming 
K ideal profiles, each individual is assigned a grade of membership (GoM) vector g that corresponds 
to coordinates in a simplex with K vertices, each of which represents an ideal, or extreme, type of 
individual. For example, in the case of two profiles ( K = 2), non-zero entries in g define the 
individuals that belong to the k\ profile, while those equal to zero determine who does not belong 
to the same k\ profile, but to instead. These cases can be understood as vertices (endpoints) of 
a line. However, some elements do not lie on the vertices but on the line connecting them. They 
share characteristics of both profiles. If, for example, the vertices correspond to ‘very high risk’ and 
‘very low risk’, respectively, then g = (g-\ . 1 — g\ ) with 0 < <j\ < 1 identifies an individual with a 
degree of risk that is intermediate between the extremes corresponding to the vertices. In the case 
of three profiles ( K = 3), however, the geographical representation moves from a line to a triangle, 
and individuals with shared membership will lie on either an edge of the triangle or in the interior. 
Individuals on edges share conditions with only two vertices, and degree of similarity to one or the 
other of them provides for straightforward interpretation. However, individuals in the interior of the 
triangle share conditions with all three vertices, and writing a coherent English sentence describing 
the shared conditions becomes a more substantial challenge. 

More formally, the number of non-zero entries in g is the number of vertices with one or more 
associated conditions that represent the state of a given individual. For a given K, there are (* ) 
edges of the simplex that represent shared conditions among two vertices. There are ( h , ) faces 
of the simplex that represent shared conditions among three vertices, and a g-vector with all non- 
zero entries corresponding to an individual who shares conditions with all K vertices. The central 
problem associated with values of K > 4 is writing an interpretable description of what it means 
to share conditions with four or more ideal types. This, of course, is not a statistical problem, but 
it presents a challenge to investigators that, to the best of our knowledge, has received almost no 
attention in the extant literature on mixed membership models. 

This paper addresses this issue. Our purpose is to present examples of problems where mixed 
membership modeling with non-standard model specifications and/or vertex-edge-face aggregation 
schemes facilitates interpretability of shared conditions. Section 8.2 contains specifications of the 
mixed membership model used in Sections 8.3 and 8.4. In particular, a non-standard specification 
is introduced where the original response vector is partitioned into blocks of variables associated 
with distinct subject-matter domains and a mixed membership model is estimated for each domain, 
thereby generating a set of GoM vectors for each individual, one vector for each domain in the 
partitioning. This facilitates interpretability in characterizations of disease risk, as we indicate in 
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the next section. In Section 8.3 we discuss an example of characterization of community risk of 
Chagas disease in rural Argentina, where the original response vector is partitioned into compo- 
nents focused on indices of blood availability and environmental characteristics. Thus, GoM score 
vectors are generated for each of these domains and interpreted in the context of overall community 
risk of transmission of Chagas disease. In Section 8.4, we illustrate a standard mixed membership 
model with 6 pure types, and introduce a vertex-edge aggregation strategy in a study of changes in 
disability over time in the U.S. population during the period 1982-1994. The aggregation scheme 
facilitates interpretation of shared membership in a setting where model complexity could, in prin- 
ciple, impair our ability to describe subtleties in the data. We conclude with a brief discussion of 
open methodological problems that are an outgrowth of our examples. 


8.2 Mixed Membership Model Specifications 

Let X = (ATi, . . . ,Xj) be a vector whose components are variables which can each assume a 
finite number of possible values. We consider the data analytic task of mapping response vectors 
{jW, 1 < i < N} for N individuals into a unit simplex with K vertices (to be estimated) and 
a GoM vector, gW = (g x ; . . . ,g K ) M with gi + ■ ■ ■ + Qk = 1 for each individual, specifying 
a location in the unit simplex. Each vertex of the simplex is associated with a set of levels on a 
subset of the variables in X. Each set of such levels is interpreted as an ideal, or pure type, set of 
characteristics. GoM vectors that have all components equal to zero except one of them, say the 
fcth — which must be a 1 — identify individuals with response vectors having all of the conditions 
in the fcth pure type. GoM vectors with two or more non-zero entries identify individuals whose 
response vectors share conditions with the pure types corresponding to the non-zero entries. They 
exhibit mixed membership across the K pure types, and provide the rationale for the terminology, 
‘mixed membership models.’ 

In the conventional version of mixed membership models (Manton et ah, 1994; Erosheva et ah, 
2007), we assume that the variables in the response vector X are independent, conditional on the 
GoM score vector g. More formally, we let X <9! denote the response vector for an individual with 
GoM score vector g. Then we introduce the probability model (Singer, 1989): 

Pr(X (9) = i) = f Pr(. X (3) = l|g = y)dp(y) = f JJ Pr(xj s) = if g = 7 )^( 7 ), (8.1) 

J Sk J Sk j = 1 

where ^(7) is the distribution of GoM scores and Sk = {7 = (71, , 7 k ) • 7 /c > 0, ^ 7^ = 1} 
is the unit simplex with K vertices. The conditional probabilities in Equation (8.1) can be written 
as 

K 

Pr(X (5) = £j |g = 7) = 

fc= 1 

where A k,j,l (called pure type probabilities) can be defined as the probability in profile /,: of observ- 
ing level £j on variable X :i . They are subject to the following constraints (Woodbury and Manton, 
1982): 

7^ A = 1 for 1 < k < K and 1 < j < J, 

and Lj is the set of possible levels of variable X :i . Values of these probabilities close to 0 or 1 
imply that a distinguished level is either almost certain to appear, or almost never to appear, in the 
fcth pure type. The estimation problem for specification (8.1) is to find K, the associated pure type 
probabilities, and the individual GoM score vector g such that the model provides a good numerical 
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fit to the data and is also interpretable. The words ‘good fit’ are associated with numerical goodness- 
of-fit criteria that are context free. However, the word ‘interpretable’ is not connected to statistical 
methodology, but is directly linked to our ability to describe the position of an individual in the 
simplex, i.e., describe the GoM score vector g in coherent English sentences that make sense in 
the scientific setting of a given dataset. The fundamental problem under discussion in this chapter 
is numerical identification of values of K which can be anywhere from 4 to 60 or 70, depending 
on the dataset, and the inability of the investigator to write a coherent paragraph describing GoM 
vectors with four or more non-zero entries and the sharing of membership among many pure types. 

Depending on the dataset, and with a large value of K associated with the best fitting model, 
we may find that nearly all individuals have GoM vectors that place them on a vertex or on an edge 
of the simplex, sharing conditions with at most 2 pure types. This is a relatively easy situation in 
which to provide interpretable descriptions of response vectors sharing conditions between the pure 
types. If 90% or more of the individuals share conditions with at most 3 pure types — i.e., they are, 
at worst, situated in a face of the simplex — interpretable summaries of the shared conditions can 
also frequently be produced. In a study of classification of scientific papers (Airoldi et al., 2010), 
mixed membership models with K = 20 were utilized on the basis of goodness-of-fit criteria, 
but nearly all papers in the classification exercise shared conditions with at most 5 pure types. 
Most of the papers with shared pure types involved only 2 or 3 such profiles. In the context of 
that study, even sharing of 3 or 4 pure types resulted in interpretable formulations of the shared 
conditions. However, GoM vectors with 5 non-zero entries seem to be close to an upper bound on 
shared condition interpretability. In Section 8.4, we will show an example with K = 6, but where a 
subject-matter-driven aggregation of edges in the unit simplex leads to interpretability of the set of 
all shared conditions among pairs of vertices. 

In studies of disease risk, as illustrated in Section 8.3, the variables in X can frequently be par- 
titioned into subsets, each of which is associated with a different domain of risk. This context also 
makes it desirable to use 2 pure type specifications corresponding to the extremes of high and low 
risk for each domain. What we frequently find with high-dimensional X is that 2 pure type models 
provide a poor fit to the data, but models with 3, 4, and 5 pure types do much better numerically, 
while paying a high price in losing gradation of risk interpretations of shared conditions. One route 
out of this dilemma is to change the conditioning structure of the mixed membership model and 
produce two or more sets of 2 pure type representations, one for each substantively defined risk 
domain. Then for the variables associated with each domain, we have high and low risk interpre- 
tations of 2 pure types and a separate GoM score vector for each individual, one vector for each 
domain. Essentially we are trading higher dimensionality in K, with a single GoM vector, for lower 
dimensionality in I\ (namely, K = 2), but with a set of correlated GoM vectors for each individual, 
one vector for each risk domain. 

We describe a mixed membership model structure for a multiple domain specification in the 
simplest setting (two domains). Let X = , X ^ ) be a response vector with subsets of variables 

X M associated with the subject matter domains indexed by /' ■ 1,2. We introduce the pair of GoM 
vectors g' : 1 ' and f' 2> and assume that the variables in each of jW fori = 1,2 are independent, 
conditional on g = (g^\ g^). Then we set 

Pr(X = l|g = 7 ) = Pr{X\ x) = £« jg = 7 ) II Pr ^ = ^18 = 7), 

;e(i) ie(2) 

= n e 7^xu n e 

ie(i)fcGJf(i) ie(2)keK(2) 

Here, g = (g (1 \ g ^),K(j) = Number of pure types in group (j),j = 1,2. In the context of the 
risk profiles mentioned above, we would impose K (1) = K(2) = 2. In Section 8.3, we present an 
analysis of Chagas disease risk where the representation (8.2) plays a central role. 
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8.3 Chagas Disease Risk in Rural Argentina 

Rural communities that are endemic for Chagas disease usually consist of privately owned habi- 
tats containing a primary house and peridomestic structures that serve as animal pens/housing, crop 
storage areas, and tool sheds and storage areas for agricultural equipment. The physical characteris- 
tics of houses and peridomestic structures are highly variable, thereby creating great heterogeneity 
within a community in sources of blood meals for triatomine bugs. Dogs, cats, chickens, and young 
children are all part of the transmission system, and their physical proximity during the night is an 
important factor in attracting Triatoma infestans vectors, the primary transmitters of Trypanosoma 
cruzi parasites that are the causative agents of Chagas disease. 

We consider a community with 445 habitats in rural Santiago del Estero, Argentina (Paulone 
et al., 1991), where at baseline, 99.6% (443/445) habitats were infested with T. infestans bugs. As 
part of the initial data collection, T. infestans were collected from 390 (88%) of the houses and from 
the peridomestic structures of 280 (63%) habitats. A total of 6,518 T. infestans were captured in 
the 390 infested bedroom areas. Of these, 2,249 bugs were examined for T. cruzi parasites, and 697 
(31%) were found positive. On the human side of the transmission system, 2,153 (69%) of the 3,194 
persons in the community were serologically tested. The prevalence rate of seropositivity against 
T. cruzi infection was 29.2% (630/2153). Age specific seropositivity ranged from 9.6% in children 
under the age of 5 years to 57.7% in persons aged 70 or more. For the age group of children aged 
5-14 the seropositivity rate was 25.3%. 

Despite the high overall seropositivity rate, there were sets of habitats with very low rates and 
other sets with high rates in children in the age range 5-14. This is a useful age range for assessing 
Chagas disease incidence in the relatively recent past, particularly in a community where there has 
been no active control activity against infestation prior to the study by Paulone et al. (1991). Further, 
it is not at all obvious which habitats are at highest or lowest risk for Chagas disease transmission on 
the basis of a walking tour of the community, or even a verbal estimate from the community health 
officer. Our analytical problem is to identify the highest and lowest risk habitats in the midst of a 
highly heterogeneous endemic community, with the longer term objective of adapting the features 
of the low risk habitats on a wider scale as a hopefully low cost means of preventing transmission 
of T. cruzi. 

To this end, a mixed membership modeling exercise using specification (8.2) was carried out 
using variables from two distinct domains: indices of blood availability and characteristics of the 
physical environment (Chuit et al., 2001). These are delineated in Table 8.1. For each domain a 2- 
pure type model (interpreted as levels of high and low risk) was fit to the data from the 445 habitats. 
A GoM score vector associated with each domain was generated for each habitat. Then, with only 
two profiles for each domain, the GoM vectors g^ = (g- t . \ — gfj and = g 2 , 1 * 72 ) associated 
with each habitat have gi indicating the degree of similarity of the habitat characteristics to the high 
risk profile for i = 1 (blood availability) and i = 2 (environmental characteristics). Using tertile 
cut points, for each risk domain (blood availability and environmental characteristics) we define 
a habitat to be low risk if 0 < gr < 0.20; intermediate risk if 0.20 < gt < 0.70; and high risk 
if gi > 0.70. Cross classifying the habitats by their scores g,;, for i = 1, 2, Table 8.2 shows the 
breakdown of seropositivity rates for children in the age range 5-14 years. 



164 


Handbook of Mixed Membership Models and Its Applications 


TABLE 8.1 

Response variables used in the mixed membership model. 


Indices of Blood Availability 

Conditions 

number of dogs 

= 2; > 2 

number of cats 

= 2: > 2 

number of persons 

= 2; > 2 

persons/room 

= 2; > 2 

people/bed 

= 2; > 2 

people/[structures in the habitat] 

= 2; > 2 

persons + dogs + cats 

= 5:6-8; > 8 

[persons + dogs + cats]/[room in the house] 

= 6; >6 

[persons + dogs + cats]/bed 

= 4; > 4 

[persons + dogs + cats]/[structure in the habitat] 

= 3; > 3 

Environmental Variables 

Conditions 

seasonal migration 

No; Yes 

condition of interior roof 

Good (cement, zinc, fibroce- 


ment); Bad (straw, jarilla, dis- 


card) 

condition of interior walls 

Good (cement, lasterwall, no 


cracks); Bad (unplastered mud 


or brick with cracks) 

condition of gallery roof 

Good; Bad 

number of rooms 

= 1; > 1 

number of beds 

= 3; > 3 

corn storage area 

No; Yes 

kitchen 

No: Yes 

equipment store room 

No; Yes 

corral 

No; Yes 

pig pen 

No; Yes 

brick pile 

No; Yes 

Source: Chuit et al. (2001). 


TABLE 8.2 

Seropositivity rates (%) by habitat risk for children aged 5-14. 

I Risk Domain Environmental Characteristics 


Risk level 

Low 

Medium 

High 

Low 

18.8 

16.2 

21.4 

Medium 

20.5 

22.0 

9.0 

High 

19.3 

22.8 

25.0 
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Estimation in the GoM model (8.2) was carried out by estimating the GoM score vectors and 
pure type probabilities for each domain (environmental and blood availability) separately via GoM 
model (8.1) with K = 2 in each case. The two sets of GoM scores, g (1 ) and g^ 2 ^, with a pair of such 
scores for each habitat, are — not surprisingly-correlated. Determination of the association between 
GoM scores begins with a scatter plot of (<7i , < 72 ) for the full set of habitats. Division of GoM scores 
into tertiles for purposes of classifying habitats by gradations of risk was the result of judgment by 
the investigators that such coarse graining led to qualitatively different categories of risk in both the 
environmental and blood availability domains. Quartile, quintile, or finer divisions could have been 
used, but these do not lead to meaningfully distinct risk categories. We emphasize here that this is 
not a statistically driven categorization. It is based on subject matter interpretations of meaningfully 
different levels of risk. 

Turning to the profiles per se, the high risk profile for blood availability is represented by the 
logical AND statement: [more than 2 dogs] AND [6 persons or dogs or cats per room] AND [3 per- 
sons or dogs or cats per structure at the habitat]. Low risk is characterized by [None of the adverse 
conditions in the blood availability section of Table 8.1]. The high risk profile for physical environ- 
mental characteristics is given by: [poor interior roof] AND [poor gallery roof] AND [presence of a 
pig pen] AND [presence of a brick pile]. Low risk for environmental characteristics is characterized 
by no adverse house or peridomicile conditions from the full list in Table 8.1 (Chuit et ah, 2001). 

If the pattern of seropositivity rates corresponds to the risk levels for the habitats. Table 8.2 
should be a double-gradient table in the sense that the rates should increase in going from left to 
right across each row and from top to bottom down each column. Rows 1 and 3 and column 2 exhibit 
this pattern, but there is one exceptional cell (row 2, column 3) corresponding to medium risk on 
blood availability and high risk on environmental characteristics. This requires some explanation, 
which we provide below. Column 1 appears to have a violation in row 3; however, these rates (19.3% 
and 20.5%) are not statistically significantly different from each other at level 0.05. 

The aberrant cell (row 2, column 3) is high risk on environmental characteristics. This partic- 
ularly means that there is a poor interior and gallery roof. While these conditions characterized 
habitats with high risk environmental characteristics up to approximately 18 months — two years 
prior to serological data collection for the present study, the owners engaged in roof repair on their 
houses. The immediate effect was to eliminate localities that were previously hospitable to T. in- 
festans. It is, therefore, not surprising that the incidence rate for new Chagas disease cases dropped 
precipitously at those sites. It is important to emphasize that all owners of houses in the community 
did not engage in roof repair. Indeed, examination of Table 8.2 provoked a deeper inquiry into why 
the (2, 3) cell was so anomalous. Examining the full information set for the habitats in this cell re- 
vealed that the GoM analysis had isolated the locations in a highly heterogeneous community where 
roof repair was making a major difference in Chagas disease incidence. 

With the extreme habitats identified — meaning those scoring high risk on both blood availability 
and environmental characteristics as well as those scoring low risk on both dimensions — a more 
in-depth analysis was carried out to characterize the most (and least) risky habitats. To this end, 
variables defining host availability and environmental conditions (shown in Table 8.1) were used to 
calculate the odds ratio (OR) comparing the proportion of habitats with the highest level of risk to 
the proportion having the lowest level of risk; 95% confidence interval (Cl) on the odds ratio was 
also calculated. A condition was then defined to be extremely risky if the lower bound of the 95% Cl 
on the odds ratio exceeded 3.5. Analogously, for the habitats classified as low risk on both domains 
in Table 8.2, the odds ratio comparing the lowest risk condition on each variable with the highest 
risk condition on that variable and its 95% Cl was calculated. Now, a condition was defined to be 
extremely low risk if the lower bound of the 95% Cl exceeded 3.5. Applying these stringent criteria, 
a new set of low and high risk conditions were specified. They are described in Table 8.3 together 
with the seropositivity rates for the subset of habitats satisfying them. 

Identification of habitats with these very different seropositivity rates that were, nevertheless, 
embedded in an endemic community provided evidence that our mixed membership methodology. 
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TABLE 8.3 

High and low risk conditions and associated seropositivity rates for children aged 5-14. 


Level of risk 

Conditions 

% seropositive 

Low 

[# of peridomicilliary structures = 1, but no 
presence of a food storage area] AND [1 
dog OR 1 cat] 

7.7 

High 

[# of peridomicilliary structures > 2] AND 
[presence of food storage area] AND [> 1 
dog OR > 1 cat OR both] 

36.4 


together with the second stage screening of habitats that were low and those that were high on 
risk variables from the two domains, deserves attention in other risk assessment settings. The low 
risk conditions in Table 8.3 are associated with a seropositivity rate among children aged 5-14 of 
7.7%, significantly lower than the rate observed in the total population (25.3%). These conditions 
are, in fact, a basis for relatively simple and inexpensive restructuring of individual habitats to 
substantially reduce Chagas disease risk. Further, from a methodological perspective, the standard 
mixed membership model structure, which contains only a single GoM score vector, automatically 
masks over the risk domain distinctions achievable with the partitioned model of Equation system 
(8.2). 

A final methodological point pertaining to this example concerns the bivariate distribution of the 
GoM scores (51, 52)- An empirical distribution is obtained via conditional likelihood calculation of 
GoM scores for each of the domains separately using the specification (8.1). There is currently no 
theoretical basis for a priori imposing a class of bivariate distributions to represent the GoM scores 
from two domains in the context of Chagas disease epidemiology. This situation could change with 
particular applications, but thus far there is not enough experience using specification 2, or models 
with three or more distinct domains, to warrant putting forth defensible classes of bivariate distri- 
butions for (51,52)- Carrying the modeling into a Bayesian framework would require specification 
of defensible prior distributions on the GoM scores. Except for a nearly uniform prior on the unit 
square, we await the development of subject-matter-driven specification of more informative priors 
for use with model specification (8.2). 


8.4 Disability Change in the U. S. Population: 1982-1994 

Populations aged 65 and older at the level of communities contain many people who have multiple 
disabilities and chronic conditions, no combination of which occurs at high frequency. This makes 
classification of elderly populations into disability/chronic conditions groups particularly problem- 
atic. Simply describing the joint distribution of co-morbid conditions is an unwieldy and difficulty 
task. This setting, however, is precisely where mixed membership models can play a useful role in 
terms of representing the heterogeneity in elderly populations via interpretable sets of pure types 
and characterizations of shared membership between them among selected sub-populations. Berk- 
man et al. (1989) put forth an initial analysis in this direction, focused on the elderly community of 
New Haven, CT. 

Data used for the analysis was derived from the National Long Term Care Survey (NLTCS) 
list-based samples of approximately 20,000 persons age 65+ drawn from Medicare enrollment files 
in the years 1982, 1984, 1989, and 1994. To ensure a national sample of the age 65+ population at 
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each survey date, a fresh supplementary list sample was drawn from Medicare enrollment files in 
1984, 1989, and 1994. A detailed description of the NLTCS is given in Corder et al. (1993). The 
analysis examined sub-groups that contributed to an overall decline in disability between 1982 and 
1994, and some that did not follow this general trend. Mixed membership models with the structure 
of Equation system (8.1) were fit for each of the survey years to response vectors whose coordinates 
described the ability, or not, of individuals to perform a diverse set of “Activities of Daily Living” 
(ADLs), tests of physical functioning, or both. A battery of 27 ADLs, “Instrumental Activities of 
Daily Living” (IADLs), and functional impairment measures were employed for this purpose. They 
are listed in Table 8.4. 

TABLE 8.4 

Activities of daily living and measures of physical functioning assessed in the 
National Long Term Care Survey. 

ADL Items: need help with 
Eating 

Getting in/out of bed 
Dressing 
Bathing 
Using a toilet 
Getting about outside 

Are you 

Bedfast 

Using a wheelchair 
Restricted to no inside activity 

Can you 

See well enough to read a newspaper 

How much difficulty do you have: none, some, very difficult, cannot at all 
Climbing 1 flight of stairs 
Bending for socks 
Holding a 10 lb. package 
Reaching over head 
Combing hair 
Washing hair 
Grasping small objects 
Source: Manton et al. (1998). 


IADL Items: need help with 
Heavy work 
Light work 
Laundry 
Cooking 
Traveling 
Grocery shopping 
Managing money 
Taking medicine 
Telephoning 


Lor the community population, the best fitting mixed membership models for each of the survey 
years, satisfying Equation system (8.1), had K = 6 pure types. There was very little variation in the 
pure types across the survey years. Independent of the model, a 7th pure type/profile was added for 
the elderly institutionalized population. This group was quite homogeneous, having an average of 
4.8 ADLs chronically impaired. The full set of pure types is described in Table 8.5. 

Although the pure types are clearly interpretable, sharing of conditions across sets of 2 and 
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TABLE 8.5 

Disability pure types from mixed membership model with K = 6. 


I Active, no functional impairments. 

II Very modest impairment, some difficulty climbing stairs, lifting a 10 lb. 
package, and bending for socks (no ADL or IADL). 

III Moderate physical impairment, great difficulty climbing stairs, lifting 10 
lb. package, reaching over head, etc. (no ADL or IADL). 

IV All IADLs, great difficulty climbing stairs, and lifting a 10 lb. package. 

V Some ADLs and IADLs, difficulty climbing stairs, cannot lift a 10 lb. 
package. 

VI All ADLs and all IADLs, and all tasks (high percentage in wheelchairs). 

VII Institutionalized - these are not included in the mixed membership 
model. 

Source: Singer and Ryff (2001). 


especially 3 pure types with only mild differences among some of the characteristics presents se- 
rious difficulties for differentiating among sub-groups. To resolve this difficulty, we introduce a 
context-specific strategy for aggregating vertices and edges of the simplex to create a coarser set 
of disability categories. For this, observe that pure types I— III represent persons who are generally 
functionally intact. In contrast, pure types IV-VI identify persons with significant physical or cog- 
nitive impairments. Heterogeneity within the functionally intact group is represented by persons 
who share conditions with pairs of pure types I— III. Such people have GoM score vectors at a given 
survey with non-zero entries for precisely 2 pure types. For example, g = (0.3, 0.7, 0, 0, 0, 0) is 
the GoM score vector for a person whose responses on ADL, IADL, and physical functioning are 
closer to pure type II (a weighting of 0.7) than to pure type I (a weighting of 0.3). We will denote 
the category of functionally intact persons by C(1 - 3). There are persons with response vectors at 
one of the pure types I, II, or III, supplemented by persons who share conditions with any pair of 
them. Geometrically, persons in C(1 - 3) are either at one of the vertices in the unit simplex labeled 

I, II, or III, or they are on one of the edges that link pairs of these vertices. 

Heterogeneity in the severely disabled group, labeled C(4 - 6), is represented by persons at pure 
types IV, V, and VI, or by those who share conditions with any pair of them. A different form of 
heterogeneity, C(int), is designated for persons who are on edges connecting one of the vertices [I, 

II, III] to one of the vertices [IV, V, VI]. A more extreme form of heterogeneity, designated C(res) 
is represented by persons who share conditions with 3 or more pure types. Geometrically they are 
identified by points in the faces or further in the interior of the unit simplex with I\ = 6. The parti- 
tioning of population aged 65+ into the four disability categories defined above, augmented by the 
institutionalized population, identifies clearly distinct groups with qualitatively different interpreta- 
tions of their mix of disabilities. 

Returning to the issue of disability decline mentioned at the beginning of this section, we know 
that there was a decline of 1.5% per annum in the proportion of the age 65+ population that was 
chronically disabled over the time period 1982-1994. The classification scheme introduced above 
facilitates our getting a much better picture of the variation in prevalence of chronic disability ac- 
cording to our more refined classification of it. To this end. Table 8.6 shows the percent per annum 
changes in prevalence of chronic disability from 1982-1994 by disability category generated from 
the aggregation of vertices and edges of the unit simplex that gave rise to C(1 - 3), C(4 - 6), Clint), 
and C(res). The table also includes the category Inst, which refers to institutionalized individuals. 

Prior to the production of Table 8.6, separate GoM models were run for the years 1982, 1984, 
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TABLE 8.6 

Percent per annum changes in prevalence of chronic disabilities from 1982-1994 by age and gender. 


Disability 

category 

Men aged 

65-84 

Women aged 

65-84 

Men aged 

85+ 

Women aged 

85+ 

C(1 - 3) 

+0.21 

+0.30 

+ 1.45 

+0.25 

C(4 - 6) 

-2.78 

-5.31 

-2.65 

-2.91 

C(int) 

-2.11 

-0.74 

+4.08 

-0.16 

C(res) 

-1.05 

-1.01 

-3.13 

+0.05 

Inst. 

-1.60 

-1.71 

-0.94 

+0.16 


Source: Singer and Ryff (2001). 


1989, and 1994. The remarkable feature of these separate analyses was that the number of pure 
types and the conditions entering into them were invariant over this 12-year period. Thus, changes 
involving an individual were captured in changes in their GoM score vectors over time. Having a 
chronic disability means having at least one ADL or IADL, where such disability has lasted, or was 
expected to last, at least 90 days. Prevalence of this condition in each of the disability categories is 
the basis for calculation of changes between 1982 and 1994. 

It is important to emphasize that the invariance in number of pure types and the conditions that 
enter into them that were an empirical fact of life in the present analysis is by no means generic to 
GoM modeling over time. Indeed the number of pure types could have varied between, for example, 
3 and 7, depending upon the time of assessment. In addition, the conditions entering into pure types 
at each assessment time could be different. Under such a scenario, it would be impossible to discuss 
changes over time via a table as simple as Table 8.6. 

In-depth interpretation of the category-specific changes in Table 8.6 requires the use of a much 
richer set of variables from the NLTCS then used for the present methodological discussion of mixed 
membership models. Extensive analysis of disability changes can be found in Manton et al. (1998), 
Manton et al. (2006), and Manton (2008). 


8.5 Discussion and Open Problems 

The major feature of mixed membership models that motivated their specification in the first place 
(Woodbury and Clive, 1974; Woodbury et al., 1978) was the empirical fact, arising in many studies, 
that crisp classification of individuals into well-defined categories was frequently difficult, if not im- 
possible. Standard clustering methods do not provide a way out of this impasse, and the observation 
that shared membership among two or more categories for individuals in a wide variety of scientific 
contexts is conceptually meaningful paved the way for elaboration of formal models to capture this 
idea (Woodbury and Clive, 1974; Woodbury et al., 1978; Davidson et al., 1989). Although mixed 
membership models can be specified according to a priori theories and used in a hypothesis testing 
mode, by far the most extensive use of the methodology has been in exploratory studies where K , 
the number of pure types, and the structure of the pure types themselves, is estimated from the data. 
In terms of numerical goodness-of-fit criteria, best fitting mixed membership models have been ob- 
tained in many instances where K takes on values in the range 15-50 (Airoldi et al., 2010). Then 
interpretive reports are presented with a focus on the structure of the pure types themselves, with 
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only minimal — if any — discussion and interpretation of shared membership across 2 or more pure 
types. For a notable exception to this practice, see Airoldi et al. (2010). 

Our own attempts to consider shared membership (Berkman et al., 1989; Chuit et al., 2001; Cas- 
tro et al., 2006) rather quickly highlighted the interpretability difficulties involved in simply writing 
coherent sentences to explain shared membership involving 4 or 5 pure types. This led to the alter- 
native specification shown in Equation (8.2), which is the simplest example of partitioning response 
vectors by distinct subject-matter domains and the introduction of the assumption of conditional 
independence of variables in each domain separately given the set of GoM score vectors for all of 
them. The problem of interpreting shared conditions across multiple pure types is then transferred 
to one of providing interpretable explanations for the correlation structure in the set of GoM score 
vectors across domains. The latter situation turned out to be especially informative in our example 
of characterizing Chagas disease risk in rural Argentina. This special setting is also a generic one 
for disease risk assessment. In fact, we think it highly desirable to apply the strategy implied by 
Equation (8.2) to represent risk profiles in complex eco-epidemiological contexts quite generally. 
In particular, the process of health impact assessment (HIA) for large scale industrial projects in the 
tropics could benefit from this approach (Krieger et al., 2012; Winkler et al., 2011; 2012a;b). 

The example of classification of disability in the U.S. population, discussed in Section 8.4, is a 
prototype for interpreting shared conditions where a multiplicity of pure types can meaningfully be 
aggregated into coarse categories of roughly similar conditions. It is unclear, in terms of scientific 
subject-matter, how to characterize the problems that lend themselves to this kind of aggregation 
methodology. However, we feel it would be useful to attempt such pine type, edge, and even face 
consolidations in exploratory data analyses using the mixed membership specification shown in 
Equation (8.1), when K is in the range 4—6, and certainly when K > 10. 

In summary, we demonstrate alternative specifications of mixed membership models where an 
increase in dimensionality of grade of membership scores is traded for simplicity in the number 
and structure of ideal types, with clear payoff in terms of interpretability of model output. Alter- 
natively, sub-sets of vertices and edges — or even faces — linking them can be aggregated to form 
interpretable categories of individuals, about which coherent descriptions can be formulated. The 
scientific subject matter must dictate which among these and other dimension-reducing strategies 
are to be employed for a particular problem. Although the statistical details of fitting mixed member- 
ship models to data lie outside the scope of the present paper, we direct the reader to the interesting 
Bayesian formulations in Airoldi et al. (2008; 2010) that focus on model specification (8.1). There 
is no analogous rigorous Bayesian methodology to-date for the class of specifications exemplified 
by Equation (8.2). Here is an important challenge worth taking up in the immediate future. 
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We present a model in which individuals can have multiple membership into “pure types” defined 
by ways of evolving over time. This modeling strategy allows us to use longitudinal data on several 
subjects to isolate and characterize a few typical trajectories over time, and to soft cluster individuals 
with respect to them. We present these methods in the context of an application to the study of 
patterns of aging in American seniors. 


9.1 Introduction 

In this chapter we introduce a Bayesian technique for soft clustering units based on similarities on 
their temporal evolution using longitudinal data. Clustering based on evolution over time, or tra- 
jectories, is of interest in many areas. For example, criminology researchers are often interested in 
identifying types of “criminal careers,” and in determining how a population of offenders distributes 
across them (Nagin and Land, 1993). Similarly, clinical psychologists might be interested in char- 
acterizing the developmental course of specific disorders like depression (Dekker et ah, 2007) or 
post-traumatic stress disorder (Orcutt et ah, 2004). 

In general terms, the typical clustering problem consists of arranging a number of units, e.g., 
a sample of people, into a smaller number of classes based on the similarity of some observed 
attributes without assuming prior knowledge of the specific characteristics of the groups. From a 
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model-based perspective, this is usually accomplished by fitting mixtures of the form 

K 

p{vW) = ^2* k fk(y) (9.1) 

k= i 

to multivariate data about one or more individual characteristics, say y = (y±, ...,yj), that are ex- 
pected to inform about the relevant differences between individuals. Models of this form are usually 
interpreted as partitioning the population into K disjoint sub-populations, each representing a frac- 
tion 7 Tfc of the population. The sub-populations themselves are characterized by (usually parametric) 
densities /&(), which are also to be estimated from the data. 

When dealing with phenomena that evolve over time it is occasionally of interest to cluster 
individuals based on similarities on their temporal evolution. This requires longitudinal data, i.e., 
repeated observations of the same individuals at different points in time. The most direct approach 
is to assume that there exists a number of sub-populations, each characterized by a particular tra- 
jectory, and that each individual belongs to one and only one of them. Such a setup corresponds 
to the general model-based clustering approach described in the previous paragraph. In particular, 
trajectories define specific forms of /&(), the sub-population’s joint distribution of the sequence of 
observations. 

In some applications the requirement that each individual belong exclusively to just one sub- 
population can be too restrictive to be realistic. For example, in the study of political ideology 
it is common to use terms that describe pure extreme positions, like “liberal” or “conservative.” 
However, actually assuming that every individual is either a liberal or a conservative is too broad 
a description to be useful, let alone accurate. In particular, it hides the fact that many individuals 
have opinions about different topics that correspond to more than one “pure” ideological position. 
A better alternative would be to describe individuals’ ideologies as mixtures of the pure types. 

Modeling these structures motivates the mixed membership approach. Mixed membership mod- 
els relax the assumption of exclusive cluster membership by allowing units to belong to more than 
one group simultaneously. We call this type of arrangement a soft clustering. 

In this chapter we extend the mixed membership approach to longitudinal structures based on 
trajectories. In particular, our technique allows us to construct soft-classifications based on the ways 
in which individuals evolve over time. This, in turn, allows us to isolate a few extreme or pure 
trajectories — which can be informative and easy to analyze — and to characterize units in the pop- 
ulation as individually-mixed combinations of them. We introduce this approach in the context of 
studying the individual patterns of evolution of disability in the elder American population. In the 
next section we present the applied context, which will serve as our illustrative application. Then we 
introduce the general notion of clustering based on trajectories, and present our method, the Trajec- 
tory Grade of Membership model, as a mixed membership extension of this idea. We present a fully 
Bayesian specification and an estimation algorithm based on Markov chain Monte Carlo (MCMC) 
sampling. Finally, we demonstrate the method by analyzing patterns of evolution of disability using 
data from the National Long Term Care Survey. Additional details regarding the model and analysis 
can be found in Manrique-Vallier (2013; 2010) 

9.1.1 Application: The National Long Term Care Survey 

It is well known that elder Americans are living longer than in the past. Their absolute number and 
proportion are increasing rapidly (Connor, 2006). Older people often require some form of long- 
term care, especially in the presence of disabilities (Manton et ah, 1997). Thus, efficient allocation 
of resources and overall cost prediction require information about typical patterns of disability, their 
progression over time, and their distribution over the population. 

With these issues in mind, a group of researchers and policy makers created the National Long 
Term Care Survey, NLTCS (Clark, 1998). The NLTCS is a longitudinal survey designed to evaluate 
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the state and progression of chronic disability among the senior population in the United States. This 
instrument tracks disability by recording each individual’s capacity to perform a set of “Activities 
of Daily Living” f ADL) such as eating, bathing, or dressing, and “Instrumental Activities of Daily 
Living” (IADL) such as preparing meals or maintaining finances. 

The NLTCS is comprised of six waves of interviews, administered in 1982, 1984, 1989, 1994, 
1999, and 2004. Each wave includes interviews of about 20,000 people. Whenever possible, indi- 
viduals are followed from wave to wave until death. However, due to the high mortality rate in the 
target population, each wave also includes a replacement sample of approximately 5,000 new sub- 
jects. The inclusion of these new individuals keeps each wave’s sample size approximately constant 
and representative of the population at each given time (Clark, 1998). In aggregate, approximately 
49,000 people have been interviewed between 1982 and 2004. 

We represent individual-level NLTCS data as an array, ( Ujt)jxT , of J binary items (ADLs and 
IADLs) measured at T points in time (waves). Each entry of the array yj t represents the presence 
(y Jt = 1) or absence (y Jt = 0) of impairments to perform ADL/IADL j at the time of wave t. 
Table 9.1 shows a hypothetical example of individual NLTCS data, considering only the ADLs 
(J = 6, EAT: Eating; DRS: Dressing; TLT: Toileting; BED: Getting in and out of bed; MOB: Inside 
mobility; BTH: Bathing). In this example we see that our hypothetical subject did not experience 
impediments in bathing until the 5th wave of the survey, in 1999. Similarly, by the time of the 
6th wave, he/she had limitations in performing all the ADLs from the list. The NTLCS also records 
other complementary information, some of which is time-dependent (e.g.. Age, in the example), and 
some of which is fixed (e.g. Date of Birth and Date of Death, in the example). In this application 
we will be concerned only with the binary responses to the ADL questions, and the age of each 
individual at each wave. 


Wave (t) 

1 2 3 4 5 6 


Year 

1982 

1984 

1989 1994 

1999 

2004 

EAT (j 

= 1) 

0 

0 

0 

0 

1 

1 

DRS (j 

= 2) 

0 

1 

0 

0 

0 

1 

TLT (j 

= 3) 

0 

0 

0 

1 

1 

1 

BED (j 

= 4) 

1 

1 

0 

1 

1 

1 

MOB (j 

= 5) 

0 

0 

0 

0 

1 

1 

BTH ( j 

= 6) 

0 

0 

0 

0 

1 

1 

Age: 


66 

69 

74 

79 

84 

89 

DOB: 


1916 

DOD: 


2005 


TABLE 9.1 

Example of data structure for a single fictional individual. The individual itself is indexed by the 
letter i € {1...N}. 


We will introduce our methods by modeling the evolution of the probability of acquiring specific 
disabilities as a function of personal time (time in the system or age), using the NLTCS data. 
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9.2 Longitudinal Trajectory Models 

9.2.1 Clustering Based on Trajectories 

Multivariate data arising from longitudinal studies can sometimes be thought of as the expression 
of a time-continuous underlying process. For example, in the study of disability among elders, it 
is reasonable to assume that the sequence of discrete disability measurements of an individual is 
an observable expression of an underlying “aging process,” that relates age to the probability of 
experiencing a disability. We can go further and assume that this process is such that the probability 
of experimenting a functional disability will tend to increase as the person ages. 

In some applications it is possible to describe the evolutionary process that underlies the ob- 
served longitudinal data using parametric functions of time or of a time-dependent covariate. We 
call these functions trajectories. An example of a trajectory is a function of age that determines the 
probability of experimenting a disability. 

When the population under study is, or may be expected to be, heterogeneous, we cannot assume 
that every individual in the population follows the same underlying trajectory. In our application, 
for example, that would force us to expect that all individuals age the same way. Instead, when 
modeling these situations, we need to allow distinct individuals to respond to different trajectories. 

Adopting such a modeling scheme, where each individual is allowed to have his/her own trajec- 
tory, opens up the possibility of clustering the population based on similarities among trajectories. 
For instance we could try to cluster individuals from the NLTCS into classes defined by “types of 
aging.” We call this clustering strategy clustering based on trajectories. Besides its intrinsic ap- 
plication domain interest, clustering based on trajectories has also the advantage of allowing us to 
incorporate additional knowledge about the trajectories (e.g., their expected shape) to complement 
the information already contained in the individual sequences of responses. 

9.2.2 Hard Clustering: Group-Based Trajectory Models 

A direct approach to clustering based on trajectories is given by group-based trajectory models 
(GBTM) (Nagin, 1999; Connor, 2006). These models assume the existence of a few homogeneous 
sub-populations whose members’ responses follow the same trajectories over time. Thus, it enables 
a type of hard-clustering of the population of interest. 

To see how a GBTM works, let us consider modeling the progression of a single binary response. 
Let y = (j/i, ..., y t , ..., yx) be the sequence of binary measurements at times t = 1, 2, ..., T for the 
same individual. We assume that the individual has been sampled from one of K sub-populations, 
with probability 7 r*, (k = 1, 2, ..., K). Then we specify the trajectory of the probability of a positive 
response for a member of group k, <$>g k ( x ) as some convenient function of a time-varying quantity 
x. indexed by parameters Or : 

<&g k (x) = Pt(yt = l|x, individual belongs to group k). 

Let x = (xi,X 2 be a vector containing the T measurements of the time-dependent 
quantity of interest, e.g., the age of the individual at each survey wave. Then, assuming that given 
group membership and x t , responses are all independent, we have that 

K T 

p(y\x,0) = 5>n fo k {yt\x t ), (9.2) 

k — 1 t=l 

where fg k (y t \x) = < f > e ( . {x) Vt (\—<Pg k {x)) 1 ~ Vt . The model in (9.2) is a discrete mixture that specifies 
the distribution of a response variable within each sub-population, conditional on a time-dependent 
covariate. 
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Connor (2006) proposed an extension for multivariate binary data consisting of J binary vari- 
ables measured at T points in time, and applied it to the analysis of the NLTCS. His proposal 
extended (9.2) by assuming conditional independence between responses to different items at dis- 
tinct points in time, given covariates and group membership. Let y ]t be the response to item j at 
measurement t (e.g., to the yth ADL of the NLTCS at wave t). Connor’s model is 

K J T 

p{y\x,9) = n n n fo jk (yjuX t ). (9.3) 

fc= t j=it=t 

This specification characterizes each sub-population based on J trajectories, each of them common 
to all of their members. 

9.2.3 Soft Clustering: The Trajectory GoM Model 

The requirement of group-based GBTM of within-cluster homogeneity can be too restrictive in 
some applications. For instance, in the NLTCS case it essentially requires us to assume that every 
individual within a sub-population follows the exact same aging process. This is not plausible. 
Furthermore, one might even wonder if such sub-populations exist at all (see e.g., Kreuter and 
Muthen, 2008). 

One way of relaxing this strong assumption is by replacing the requirement of exclusive mem- 
bership with a mixed membership structure — thus constructing a soft clustering based on trajecto- 
ries. In such a case, we interpret latent groups not as sub-populations — as with group-based trajec- 
tory models — but as characterizations of extreme cases. We then model individual trajectories as 
mixtures of extreme trajectories in different individual degrees. 

The rest of this section is devoted to presenting one such model, which we will call a Trajectory 
Grade of Membership (TGoM) model. 

In longitudinal multivariate settings we are interested in studying the simultaneous progression 
of a number, </, of variables as a function of time. For now, assume that response variables are binary 
and that we have measurements of each variable at a number, T, of points in time. Call yj t the value 
of the jth variable ( j = 1, ..., J) at measurement time l (t - 1. ..., T) for a particular individual. In 
the NLTCS case, y.j t is the disability measurement j (yth ADL) at wave t. 

Similar to a group-based trajectory model, we assume the existence of a small number, K, of 
ideal types of individuals or extreme profiles. However, instead of assuming that particular individ- 
uals belong completely to those classes, we endow them with membership vectors, g = (gi--.,gK) 
(gk > 0, Ylk !Jf- = !)• Membership vectors are a characteristic of each individual. Their compo- 
nents, <7fc, represent the degree of membership of an individual in each of the K extreme profiles. 
Ideal individuals of the fcth type are individuals whose membership vector’s fcth component has a 
value gi; = 1, and the rest of the entries are zeros. For instance, an individual with membership 
vector g = (0, 1, 0, 0) belongs exclusively to the extreme profile k = 2. An individual with mem- 
bership vector g = (0.1, 0.2, 0.7) has 10% membership in extreme profile k = 1, 20% in k = 2, 
and 70% in k = 3. 

We specify the trajectory of a positive response for each response variable j and extreme profile 
k, &e jk (x) as a function of time with parameter 9jk ■ These trajectories correspond to idealized pro- 
gressions of the variables of interest over time, in the same way that trajectories in the developmen- 
tal trajectory model represent the progression of variables for particular groups over time. Different 
from GBTM though, we do not regard individuals as being samples from the sub-population, but 
mixtures of them. 

Using the membership vectors, we model the trajectory of variable j for an individual with 
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FIGURE 9.1 

Example of extreme and individual trajectories. Extreme trajectories are drawn in thick lines. In- 
dividuals have membership vectors (0.1, 0.2, 0.7) (top thin solid curve) and (0.2, 0.7, 0.1) (bottom 
thin solid curve). 


membership vector g = (gi , ..., gx) as 


K 

PiVjt \(9 i,-,9k),x, 6) = ^2gkfe jk (Vjf,x). (9.4) 

k= 1 

As an example with I\ = 3 extreme profiles, consider the situation in Figure 9.1. Curves in thick 
lines represent three extreme trajectories, $ 0 ^( 2 ;), $0 2 (x), $0 3 (a:), for an arbitrary ADL, j. Ac- 
cording to (9.4), given an individual i whose membership vector is g,, the probability of a positive 
response to item j is a weighted combination of the extremes, which defines an individualized 
trajectory, d>W(a;) = Pr {y^ = l\gi,x) = X^=t 9ik^jk{ x )- Extreme trajectories — thick lines — 
correspond to (most likely fictional) individuals whose membership vectors are (1, 0, 0), (0, 1, 0), 
and (0, 0, 1). The two individual trajectories — thin lines — in the picture correspond to individuals 
whose membership vectors are (0.1, 0.2,0. 7) and (0.2, 0.7, 0.1). 

In order to characterize the joint distribution of individual responses, we introduce a local inde- 
pendence assumption: for a single individual, conditional on the value of the covariate of interest at 
time t, x t , and its membership vector, g, the J responses at each of the T measurement times are 
mutually independent: 


J T K 

nnz gkfe ]k {yjuXt). (9.5) 

3=1 1 = 1 k= 1 

Moving to the sample, we assume that there are N individuals. We index them using the letter 
i = 1, . . . , N and add a corresponding sub-index to the individual-level quantities y, , r/ v , and x t . 
Assuming that each individual has been randomly sampled from the population, we get the joint 
model for the whole sample y, conditional on all the membership vectors g, and all the time-varying 
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N J T K 

p(y ig,x, 0 ) = nnnz 9ikfe jk (yiju x it ). (9.6) 

i=l 3=1 t=l k=\ 

Finally, assuming that the membership vectors are i.i.d. samples from a common distribution, say 
F a , we get the model 


p(y|x, g,0) 



J T K 

nnz 9kfe jk {Uijt'i Xit)F a (dg). 

j= 1 1—1 k — 1 


(9.7) 


9.2.4 Latent Class Representation of the TGoM 

Similar to the Grade of Memberhsip model (see Erosheva et al., 2007), the model in (9.5) admits an 
augmented data representation that makes it similar to the group-based multivariate developmen- 
tal trajectory model in (9.3). A few algebraic manipulations on (9.5) (Erosheva, 2002) lead to the 
equivalence 

j T K J T 

nnz 9kfe jk {yjt;x t ) — EMU . /'< .,,(.'///:•'•')• (9-8) 

j=it=ifc=i zezj=it=i 

where Z = {1, 2, ..., K} JxT is the set of all matrices (zjt) whose entries take values in 
{1, 2, ..., K }. From here it follows that, after summing over all possible realizations of z, the model 


p(y,z\x,9,g) 


nnn[ 9kff)j k ( Ujl 1 Xt ) 

j= 1 1 — 1 k— 1 


(9.9) 


is equivalent to (9.5). For details applied to the case of the GoM model, see Erosheva et al. (2007). 
Considering that g ~ G and integrating (9.9), we get the unconditional distribution 


where 


J T 

p(y\x) = 

Z j — 1 t—1 


t t z = E g 


nnn^ 

j=it=ik=i 


(9.10) 


(9.11) 


Equation (9.10) shows that the TGoM can be seen as a multivariate group-based DTM, just like 
(9.3), where the membership weights are restricted by the moments-based definition of n z in (9.1 1). 
We can also see that the following generative process will produce N multivariate responses ac- 
cording to (9.10). Here we again add the individual index, i £ {1, ..., N}, to i/j t , g, and Zj t . 
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Trajectory GoM Individual Response Generation Process 


For each individual i £ {1,2, N} 

Sample g t = ( ga,,ga , ...,gy) ~ F a . 

For each j £ {1... J} 

For each t € {1...T} 

Sample z-pt ~ Discretely (gf). 
Sample y ijt ~ Be rnou 1 1 i f -, vj , { ;r;, ) ) . 


9.2.5 Specifying the Trajectory Function 

In our application it is reasonable to assume that the probability of presenting a disability should 
increase monotonically with age. We thus follow Connor (2006) in making Op. = ({3ojk, Pijk) and 
using the s-shaped function 


$e jk (x) 


1 

1 + exp(-/3 0 jfc - /3ij k x) ’ 


(9.12) 


where a; is a scalar representing age. 

In general, the choice of trajectory functions $jk{') must be application-specific, as they en- 
code assumptions about the nature of the underlying process. Thus, other applications would likely 
require different specifications. 


9.2.6 Completing the Specification 

To complete a full Bayesian specification of the TGoM model, we need to specify the membership 
distribution F a and prior distributions for its parameter a and for the trajectory parameters Op.. 

Following Erosheva et al. (2007), we assume g t \a ~ Dirichlet(a), where a = (ap • £i,ao • 
£ 2 , •••, O(o fK ) with Q'o and > 0 for all k = 1,2, ..., A", and = 1. Under this parametriza- 

tion, £ = (Ci , ...,<{/<■) is the expected value of the distribution. It also, more informally, represents 
the relative importance of profile k in the population. In turn, a 0 is a concentration parameter: the 
closer ao is to 0, the closer samples from F a will be to the extreme profiles; conversely, the higher 
the value of ao, the closer the samples from F a will tend to be to their expected value, f Thus, 
for £ fixed, ao controls the amount of mixed membership. We also follow Erosheva et al. (2007) in 
specifying independent prior distributions ao ~ Gamma(r, rj) and £ ~ Dirichlet( Ik)- 

Other specifications are possible, and in some problems they may be necessary. An important 
limitation of the Dirichlet distribution is its simple correlation structure. Regular Dirichlet distribu- 
tions do not allow the capture of complex correlations between membership in different extreme 
profiles. This might be a limitation in applications where membership in some extreme profiles has 
non-trivial relationships with membership in other profiles. A natural extension that can be useful 
in such situations is the multinomial logistic normal prior (see e.g., Blei and Lafferty, 2007). Unfor- 
tunately this specification does not share the computational advantages of the Dirichlet distribution. 
In particular, it is not conjugate to the multinomial distribution. For the parameters of the extreme 
trajectories specified in (9.12), /3, we specify independent prior distributions (3ojk N(pp 0 , er^ 0 ) 
and /3 ljk ^ N{p,g 1 ,o 2 p l ). 
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9.3 Estimation through Markov Chain Monte Carlo Sampling 

Under the specification of extreme trajectories and priors outlined in this section, and following the 
augmented data representation in (9.9), the joint posterior distribution of parameters and augmented 
data is 


N 


p(a,p,z,g\Data) <xp(fi)p(a) Y[p{9iH 


\i = 1 
N J T K 


n TT TT TT f & x P(Uijtp0jk + DijtPljkXit) \ 

1 + exp(/3 0 j k + PljkXit) J 


I (Zijt — ^0 


i— 1 j—1 t= 1 k—1 x 

Using the full Bayesian specification from Section 9.2.6 we have that 


p(gi\a) = Dirichlet(5i|ai,a 2 , -,a k ), 
p(a 0 ) = Gamma(a 0 |r, rj), 
p(£) = I)irichlet(^|l/ N -) (Uniform on the Ak-i), 

withp(a) = p(a 0 )-p{(,)’ where «o = and £ = (£i> £ 2 , ) with a k = a 0 ■ ( k . Parameters 

t and ?/ are shape and inverse scale parameters, respectively. 

Specifying an MCMC algorithm to obtain approximate realizations from this posterior distribu- 
tion using the Gibbs sampling algorithm is just a matter of obtaining the full conditional distributions 
of each parameter and augmented data. An implementation of this algorithm follows. 

1. Sampling from z: For every i £ {1 ... N},j £ {1...J}, and t £ {1 . . . T}, sample 


Zij t \... -Discrete (pi,p 2 , ,Pk ), 


with p k oc g ik exp(/3 0ifc + f 3 ljk x it ) Vi * [1 + exp(/3 0jfe + p ljk x it )\ \ for all k £ {1, . . . , K}. 

2. Sampling from (/3q jk, fiijk )• Let pa = I ( Zijt = k). Then, the full joint conditional distribution 

of (fiojk-> filjk) IS 


p{Pojk,Pijk\-) 


exp 


2^2 P°3 k \ o°0 + ^ PitVijt j 2(71*2 + al S Pit^itUijt 


J] [1 + exp ((3 0jk + @0jkXit)\ 

i,t 


Pit 


This distribution does not have a recognizable form. Thus we use a random walk Metropolis 
step: 

(a) Proposal step: Sample the proposal values 

Pojk ~ N (Pojk, vjo) and P* jk ~ N{P ljk , ajj, 

where the values cr| 0 and cr^ are tuning parameters that we have to calibrate to achieve 
a good balance between acceptance of proposed values and exploration of the support of 
the target distribution. 
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tm = 


(b) Acceptance step: Compute 

PiPojk, Pljk !•■•) 

n 


x exp 


x exp 


1 + exp 

[Poj k H - Pojk'Kit 

1 + exp 

Po jk + Pojk X it 


(9.13) 


Pojk-ftjk 

2 <t o 

2 <j\ 


+ (Po jk ~ Pojk) I- ^2 PitVijt 

a 0 ~~ 


“I” (Pljk Pi jk) ( ^ ~b ^ ' PitUijtXit 


and make 


( o . o Um+i) _ / (Pojk,P*ijk) with probability min{r M ,l} 
1 {Pojk, Pijk)^ with probability 1 - mini r M . 


probability 


i {tm, !}• 


3. Sampling from <?, : Since the Dirichlet distribution is conjugate to the multinomial, this expres- 
sion is particularly simple: 


9i\ 


indep. 


Dirichlet (cti + Ka, a 2 + k,; 2 , . . . , ax + Kik) , 


where n ik = J2 Jtt I(z ijt = k). 

4. Sampling from a: For sampling from the full conditional distribution of a. 


N 


p(a |...) ocGamma(ao|r, ij) x Dirichlet(£|l;f) x Dirichlet {g t \ a) 


occtQ 1 exp[— aop\ x 


T (ao) 


nf=ir(a fc ) 


N 


i — 1 

K 


n 

fe=i 


N 


Ih 

,i=l 


ik 


(9.14) 


we use the Metropolis -Hastings within Gibbs step proposed by Manrique-Vallier and Fienberg 
(2008): 

(a) (Proposal step) Sample a* = (af a \, ..., ot* K ), as independent lognormal variates from 

a* k lognormal(log a k , a 2 ). 

Again, a is a tuning parameter that we have to calibrate. 

(b) (Acceptance step) Let a a k- Compute 


K * \ / % \ r — 1 

at \ f a% 


r = mm- 


1 ) exp[— t(q:q — a 0 )] — ( — 

' - LJ - a k V a 0 


r( a o) tt r(qfc) 

L r ( a o)ii r K). 


*:=i 

N 


K / N \ a k-<*k - 

n nH 

fc=l \i= 1 / 


and update the chain, from step m to step m + 1 according to the rule 

( m+ 1 ) _ f a* with probability r 

a ( m ) with probability 1 — r. 
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To obtain samples from the posterior distribution of parameters, we simply cycle through Steps 1 to 
4. Selection of tuning parameters, cr 2 , cr 2 , , and cr 2 2 , can be challenging. We present an automated 
procedure for choosing cr 2 in the next section. 

9.3.1 Tuning the Population Proposal Distribution 

The MCMC algorithm described in Section 9.3 requires that we select two sets of tuning parameters 
in order to balance good acceptance rates with an adequate exploration of the support of the posterior 
distribution. The effect of the tuning parameters for sampling /3, ay i , and 0/32 is fairly independent 
of sample size, and is also stable across models with different numbers of extreme profiles. Thus 
we tune it by trial and error. The situation is different for < 7 , the tuning parameter for sampling a. 
Acceptance rates for a are very sensitive to small changes in cr. Additionally, the same values of a 
produce wildly different acceptance rates in models with different numbers of extreme profiles, and 
when dealing with different sample sizes. 

In order to reduce the costly guesswork associated with choosing cr, we propose an automated 
procedure. With a pre-specified acceptance rate in mind, acc, we try different values of cr, and 
record whether the proposals for a are accepted or not. Then we gather these results and use logistic 
regression to pick a value of cr likely to achieve the target acceptance rate. Finally, we discard all the 
generated samples and run the chain, keeping cr fixed at the found value. We note that even though 
logistic regression assumptions are not satisfied in this case — in particular, observations are clearly 
not independent given predictors — we have empirically found this procedure to deliver excellent 
results. 

In practice we use a two-phase search strategy. In the first phase we find an interval [W \ , W 2 ] 
of values of cr that make the acceptance rate fall within a target interval [acci, 0002 ]. The following 
algorithm implements this first step. It requires us to provide a reasonably wide starting interval for 
cr, \W(, WJ], and a number of steps, FS-\ . 


First Pass: Reduce the interval [W { , W 2 ] so that Pr(acceptance) € [acci, acc2]. 
Initialization: Let A = (log(FF|) — log(IFf)) /FS\. 

For n = 1, ...,FSi 

Update chain using log cr = log(VUf) + An. 

If a* accepted, let a n = 1. Otherwise, let a n = 0. 

Fit a logistic regression model, logit(a„) = a + (3/S.n. Get estimates a and fj. 

Let W2 = exp ( (logit acci — a)//3) and W\ = exp( (logit acc2 — &)/$)■ 


We have found that starting values W* = 0.001 and W 2 =0.1 work for most problems. Also, 
for a target of acc = 30%, a good first-pass target interval is acci = 0.2 and acc 2 = 0.8. 

In the second phase, we search within the reduced interval [W \ , W 2 ] for a single value of cr likely 
to attain the target acceptance rate, acc. The following algorithm also requires us to set up a number 
of iterations (FS 2 ) in advance. We have found that good choices for the number of iterations are 
FSi = 700 for the first phase and FS 2 = 300 for the second. 
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Second Pass: For acc € [W-j . IL'G], find cr such that Pr (acceptance) = acc . 

Initialization: Let A = (W 2 — W\)/ FS 2 ■ 

For n = 1, ...,FS 2 

Update chain using a = W\ + An. 

If a* accepted, let a n = 1. Otherwise, let a„ = 0. 

Fit a logistic regression model, logit (a„) = a + f3 log(A?i). Get estimates a and /3. 

Let cr = exp((log(acc/(1.0 — acc)) — a)/ $). 

We reiterate the importance of discarding the values sampled during this calibration phase. The 
calibration operation uses all the acceptance outcomes generated during the adaptation phase to 
modify the kernel of the process. Thus, it renders the whole phase non-Markovian. 


9.4 Using the TGoM 

Now we return to our illustrative application. Our goal is to identify profiles of typical trajectories 
of progression into disability and to determine the structure of membership of the population into 
those profiles. For illustration purposes we have fit a TGoM model with K = 3 extreme profiles. 

We used a sub-sample of the NLTCS that included N = 39, 323 individuals measured on T = 6 
waves. The response vector included the six (J = 6) binary coded ADLs shown in Table 9.1. 

We chose the prior distribution for a = qq • £ as independent ao ~ Gamma) 1,5) and 
£ ~ I)irichlet( l^ ). This prior specification expresses the notion of complete ignorance about the 
relative importance of the extreme profiles in the population and preference for smaller values of 
the concentration parameter, ao- The reasons behind the last choice are mostly interpretative: a 
Dirichlet distribution with small values of a 0 will produce individual realizations that are closer to 
one particular vertex of the simplex, with influence on the other vertices; and as «o g° es all the 
way down to 0, a degenerate discrete distribution over the vertices. This arrangement allows us to 
talk about “dominant profiles’’ that are influenced by the others, easing the interpretation of the 
results while still allowing the mixed membership apparatus to handle a significant degree of het- 
erogeneity. For the extreme trajectories parameters, /?, we have chosen the relatively diffuse priors 
PojkiPijk ~ N(0, 100). 

We tuned the proposal distribution for a using the two-step algorithm described in Section 9.3.1. 
The resulting tuning parameter was o a = 0.011. We set the remaining tuning parameters as op 0 = 
0.2 and opi = 0.02. 

The chain converges quickly, after around 15,000 iterations, but exhibits a rather high auto- 
correlation; for this reason, we had to perform long runs of 100,000 iterations. After that, we dis- 
carded the first 20,000 iterations and sub-sampled them, retaining one sample every 5 samples and 
discarding the rest. Figure 9.2 presents the trace plot of the parameter ao = a k- 

TGoM models include two sets of directly interpretable parameters. The first group, ao an d G 
characterizes the common distribution of the individual mixed membership scores, F a . Table 9.2 
presents estimates (posterior means and standard deviations) of these parameters. From these sum- 
maries we see that the posterior distribution of 0:0 is tightly concentrated around c?o = 0.261. This 
value is small, but it still leaves room for a significant degree of mixed membership. In particular. 
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FIGURE 9.2 

Trace plot of parameter o:q. 


a o £l ?2 £3 


0.261 0.645 0.251 0.104 

(0.006) (0.004) (0.004) (0.002) 


TABLE 9.2 

Posterior estimates of population-level parameters for model with K = 3. Numbers between paren- 
thesis are posterior standard deviations. 


the posterior point estimates of a 0 and £ imply that around 40% of the individuals have responses 
that are influenced by more than one extreme profile. 

The extreme trajectory profiles, characterized by the /3 parameters, inform of typical progres- 
sions into disability as people get older. Figure 9.3 shows such trajectories for each ADL. The first 
extreme profile exhibits aging progressions where people remain basically healthy for most of their 
lives. As we consider the other extreme profiles (k = 2 and k = 3), we observe what we can de- 
scribe as a decreasing gradation on the age of onset of disability: around 85 for profile k = 2 and 
around 70 for profile k = 3. This last profile describes a very early onset of disability, followed by 
a long decline. We note that extreme profiles are sorted according to their relative importance in the 
population (parameter £&). This means that healthy aging trajectories are the most common in the 
population and that early onset of disability is not so prevalent. 

To aid interpretation of the extreme profiles, we consider the quantity 



(9.15) 


which expresses the age at which an ideal individual of the extreme profile k reaches a 0.5 proba- 


bility of being unable to perform ADL j. We take these numbers as indicative of the age of onset 
of disability in ADL-y for extreme profile k. We add the constant C = 80 because, before fit- 
ting the model, we have re-centered the original age data by subtracting 80 years, as a matter of 
computational convenience. 

Table 9.3 shows posterior estimates of Age 0.5 jk for our fitted model. We have sorted the ADLs 
according to the estimates of Age 0 . 5 , jk to give an idea of the sequence in which people start experi- 
menting limitations. We note that the resulting sequence of ADLs remains the same on each extreme 
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FIGURE 9.3 

Extreme trajectories of disability over time for each ADL and extreme profile ( K = 3). Vertical 
discontinuous lines mark parameter Age o.sjk 


profile. Looking closer at the resulting sequence [inside mobility (j = 5) — > toileting ( j = 3) — > 
dressing ( j = 2) — >• bathing ( j = 6) — > getting in and out of bed ( j = 4) — ► eating ( j = 1)], we 
note that it corresponds to what we intuitively expect: the most severe disabilities are the latest to 
manifest. We also note that, due to the way we have specified individual trajectories in the model 
formulation, this sequence remains the same for the weighted individual trajectories. 


ADL(j) 

Extreme Profile-fc (sd) 

k = 1 k = 2 k = 3 

5 (BTH) 

3 (MOB) 

2 (BED) 

6 (TLT) 

4 (DRS) 

1 (EAT) 

95.107 (0.155) 

95.332 (0.139) 
97.824 (0.174) 
99.538 (0.231) 
100.210 (0.235) 
104.768 (0.462) 

80.539 (0.093) 
81.457 (0.084) 
83.156 (0.091) 
83.731 (0.097) 
84.873 (0.105) 
88.933 (0.172) 

65.940 (0.164) 
66.399 (0.155) 
67.674 (0.179) 
69.315 (0.181) 
69.959 (0.197) 
80.725 (0.477) 


TABLE 9.3 

Posterior estimates of age of onset of disability (posterior means of parameter Age 0 5 jk ) for model 
with K = 3 extreme profiles. Numbers between parenthesis are posterior standard deviations. ADLs 
are sorted increasingly according to estimates of Ageo. 5 jk- Note that the sorted sequence of ADLs 
remains the same for every extreme profile. 
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9.5 Discussion and Extensions 

Mixed membership models are powerful tools in situations in which we believe that a few proto- 
typical or extreme cases can be isolated and analyzed, but we do not necessarily believe that units 
conform exactly to those cases. 

In this chapter we have introduced a family of mixed membership models for longitudinal data, 
the Trajectory Grade of Membership models. These models characterize extreme profiles using 
functions that, with the help of time-dependent covariates, express the evolution of responses over 
time. Individuals have mixed membership on the extreme profiles, meaning that their evolution 
over time cannot be well described by a single extreme profile, but instead as combinations of the 
extremes, weighted by their individual membership. Through joint estimation of all the model’s 
parameters from data, these methods allow us to infer the extreme profiles’ characteristics (trajec- 
tories), the individual membership structure of units from the sample, and the distribution of the 
population with respect to the extreme profiles. 

Our application to the study of disability and aging using data from the National Long Term 
Care Survey illustrates how TGoM models work. In this application, the extreme trajectories are 
simplified representations of prototypical ways of aging, expressed as the probability of becoming 
disabled as a function of age. The mixed membership structure represents the individual hetero- 
geneity, by allowing individuals to follow individualized aging trajectories, described by weighted 
combinations of the extremes. 

TGoM models conform to the general characterization of mixed membership models described 
in Erosheva et al. (2004) and Erosheva and Fienberg (2005). As such, they admit a number of nat- 
ural extensions. First, we can expand the characterization of extreme profiles to include any other 
responses that might be reasonable to joint model. This may include discrete or continuous variables 
as well as other trajectories. For instance, analyzing the NLTCS, Manrique-Vallier (2010) modeled 
extreme profiles through the use of trajectories together with survival distributions. This way, ex- 
treme profiles did not only summarize typical ways of aging, but also typical survival patterns. 

Another natural extension can be obtained by specifying the population-level distribution of 
individual membership vectors, F a , conditional on individual-level covariates. Manrique-Vallier 
(2010; 2013) used this strategy to introduce cohort effects. Noting that as one considers younger 
cohorts, the distribution of individual membership vectors tends to be more concentrated towards 
extreme profiles characterized by healthy aging trajectories — to the detriment of other patterns. This 
allowed the detection of a steady improvement in the quality of aging for younger cohorts. 
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Alzheimer’s disease is the most frequent form of dementia in the elderly, and age is its most powerful 
risk factor. One idea is to model the probability of being diagnosed with dementia at different 
ages in order to construct trajectories for different categories of people. Mixed membership models 
constitute the most promising method for this problem. We develop a few ideas of Manrique- Vallier 
(2010) to extend the basic TGoM model. In particular, we propose a parametric dependence between 
the distribution of the membership vectors and a few time-invariant covariates that allows us to 
interpret their effect on the individual trajectories. 


10.1 Introduction 

The previous chapter by Manrique (Manrique- Vallier, 2013) introduced a family of mixed member- 
ship models, the Trajectory Grade of Membership models (TGoM), useful in analyzing longitudinal 
data, i.e., sequences of responses obtained from the same individuals at various points in time. Each 
of the N individuals in the analysis is represented by a trajectory, which describes the evolution of 
the probability of particular values of the response variables over time. The individual trajectories 
are modeled as weighted combinations of a small number K of typical trajectories, corresponding 
to K ideal types of individuals or extreme profiles. Considering only one response variable Y, an 
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individual trajectory at time t can be written as 


K 

P(yt\{gi,---,9K),x t ,0) = ^g k f ek {yt\x t ). 

fc = 1 

The membership vector g = (gi, . . . , g k ) describes the degree of closeness of an individual to each 
extreme profile; xt is the value of a time-dependent covariate (e.g., age), and fg k (yt\xt) is a function 
of time with parameter 6 k describing the trajectory of an individual of extreme profile k. 

In this chapter we develop a few ideas of Manrique-Vallier (2010) to extend the basic TGoM 
in two directions. First, we include the survival outcomes as a response: the presence of dementia 
is correlated with mortality in elderly years (Bowen et al., 1996; Brodaty et al., 2012; Molsa et al., 
1995) and information about survival times can help to explain disability patterns. Second, we add 
time-invariant covariates in the model. We propose a particular parametric dependence between the 
membership distribution G a and time-invariant covariates that allows us to interpret their effect 
on the membership vector in a way that is similar to the effect of covariates in a simple logistic 
regression. 

10.1.1 Application: The Cardiovascular Health Study — Cognition Study 

Alzheimer’s disease (AD) is the most common cause of dementia in the elderly, and age is the most 
important risk factor for the development of clinical dementia. The prevalence of AD increases ex- 
ponentially between the ages of 65 and 85, approaching 50% in the oldest segment of the population 
(Evans et al., 1989; Fitzpatrick et al., 2004). After 90 years of age, the incidence of AD increases 
dramatically, from 12.7%/year in the 90-94 age group, to 21.2%/year in the 95-99 age group, and 
to 40.7%/year in those over 100 years old (Corrada et al., 2010). This risk of AD is further affected 
by the presence of the APOE*4 allele, male sex, lower education, and having a family history of 
dementia (Fitzpatrick et al., 2004; Launer et al., 1999; Tang et al., 1996). Medical risks include the 
presence of systemic hypertension, diabetes mellitus, cardiovascular disease, and cerebrovascular 
disease (Irie et al., 2005; Kuller et al., 2003; Luchsinger et al., 2001; Matsuzaki et al., 2010; Ohara 
et al., 2011; Skoog et al., 1996). Lifestyle factors affecting risk include physical and cognitive activ- 
ity and diet (Erickson et al., 2010; Scarmeas et al., 2006; Verghese et al., 2003). It is the interactions 
among these risk factors and the pathobiological cascade of AD that determines the likelihood of a 
clinical expression of AD — either as dementia or Mild Cognitive Impairment (MCI) (Lopez et al., 
2012 ). 

The Cardiovascular Health Study — Cognition Study (CHS-CS) is a rich database of multiple 
metabolic, cardiovascular, cerebrovascular, and neuroimaging variables obtained over the past 20 
years, as well as detailed cognitive assessments beginning in 1990-91 (Saxton et al., 2004), 1 998— 
99 (Lopez et al., 2003), 2002-03 (Lopez et al., 2007), and annually thereafter. 

In 1992-94, 924 of the CHS participants in Pittsburgh underwent a structural MRI scan of the 
brain, and these individuals constitute the initial cohort of the Pittsburgh CHS-CS (Kuller et al., 
2003). In our analysis we use data from the 652 individuals who were alive in 1998 and who agreed 
to genetic testing for APOE*4. We consider a single response variable Y that codes diagnosis for 
each individual at different ages: 

! 1 if dementia 

2 if MCI 

3 if normal. 

Age is the time dependent variable that defines the trajectories. In other words, we are interested 
in the probability of being diagnosed with MCI or dementia at different ages. We will also consider 
four time-invariant binary predictors: X t = Race (White), Xi = Education (Beyond High School), 
X$ = Hypertension (Present), and X 4 = APOE*4 (Present). 
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There are a variety of pathways or trajectories that individuals can take as part of the natural 
history of AD. In order to try to capture these different pathways, we adapt the work of Manrique- 
Vallier (2010) on modeling trajectories toward disability (Manrique-Vallier and Fienberg, 2009) that 
combines features of a version of the cross-sectional Grade of Membership model (Erosheva et ah, 
2007) with those of a longitudinal multivariate latent trajectory model (Connor, 2006). This tech- 
nique allows our data to identify a small number of theoretically appealing ‘canonical’ trajectories 
to dementia or MCI and then express each individual’s trajectory as a weighted combination of these 
canonical trajectories. 


10.2 The Extended TGoM Model 

In this section we present two extensions of the Trajectory Grade of Membership model. We start 
by recalling the basics of mixed membership models then gradually include survival outcomes and 
time-invariant predictors in our analysis. 

Mixed membership models assume the existence of a small number of “typical classes” of in- 
dividuals and model their evolution over time. They regard individuals as belonging to all of these 
classes in different degree by considering them as weighted combinations of the typical classes. 
It is possible to describe distinct general tendencies (the typical cases) while accounting for the 
individual variability. 

Following the strategy described in the previous chapter, we start by assuming the existence of a 
specific number, K, of “typical classes” or “typical profiles” and we associate each individual, i, for 
i £ {1, ...,/} (in our application / = 652), with its own membership vector = (gn , . . . , gw ), 
representing the different degrees of closeness to each typical profile. Membership scores are re- 
stricted so that gzk > 0 and l 9 ik = ' for any i. An individual with membership vector 
gi = (0, . . . , 0, 1, 0, . . . , 0), where 1 is in the fcth position, is called an “ideal” (or extreme) individ- 
ual of class k. 

For any individual that is an ideal member of the fcth typical class, we specify the distribution of 
the outcome variable Y, to form a trajectory for the response variable. Therefore, 

fe k (jji |AgeJ = P{Yi = yi |Age,j, ith individual in fcth class) 

indicates the probability of outcome y, for an ideal individual of the fcth class at a particular age. 

We introduce the idea of mixed membership by setting the distribution of the outcome variable 
Yi for each individual i as the convex combination 

K 

P(Yi = Vi\(gi, ■ ■ ■ , gK ) , AgeJ = ^2gikfe k {Vi\^i)- (10.1) 

k = 1 

Then we assume that for a single individual, conditional on the age at time £, Age it , and its mem- 
bership vector, the responses at T measurement times are independent of each other: 

T K 

p { Yi = yiKsh) • • • ,gn), (Age l5 . . . , Age T )) = riE gikfe k (yit | Age it ). 

t=i fc= i 

We further assume that the individuals are randomly sampled from the population and that the 
membership vectors are i.i.d. sampled from a common distribution G a , with support A^_i to 
obtain the unconditional expression 

N r T K 

P(Y = y|Age) = If / II E M A g e it )G(dg). 

i = 1 ^ A t=1 fc=1 
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10.2.1 Specifying the Trajectory Function 

We must also specify a model for fg k (y |Age). Since in our application the outcome variable diag- 
nosis has three ordered outcomes, we consider an ordered multinomial logit model (Gelman, 2007), 
described by the two following logistic regressions, for k = 1 , ,K: 

P(Y > l|Age, individual in fcth class) = logit~ 1 (/3ofc + /3u.Age) 

P(Y > 2|Age, individual in fcth class) = logit~ 1 (/3ofc + /3u-Age — c^). (10.2) 

We then compute the probabilities of individual outcomes using the formulas: 

P(Y = 1) = 1 - P(Y > 1), 

P(Y = 2) = P(Y > 1) - P(Y > 2), 

P(Y = 3) = P(Y > 2). (10.3) 

Therefore, the expression in (10.1) implicitly contains 9 k. = (/?ofc ? Pi ki c k), for k = 1 , ,K. The 
parameters c/ : , which are called thresholds or cutpoints, are constrained to be positive, because the 
probabilities in (10.2) are strictly decreasing. 

10.2.2 First Extension: Specifying the Dependency of Membership Vectors on 
Additional Covariates 

Instead of attributing all variation over time to aging, we could place additional predictors in two 
different parts of the model. The first alternative is to place them at the level of the extreme profiles, 
as we have done for the variable Age. The second alternative is to model a dependency between 
the membership vectors and the new predictors. This is the strategy that we use, since it does not 
change the interpretation of the extreme profiles given in the previous chapter. 

Suppose that, for each of the N individuals in our analysis, we have information about M binary 
time-invariant predictors X ±, . . . , Xm ■ We evaluate the effect of these predictors on the proximity 
of individuals to the three trajectories by allowing the distribution of the membership vectors g, = 
(fjii ■, (Ji2, (Jii) to depend on the predictors: 

<7j|a(xj) ~ Dirichlet(a(xj)) for i = 1, . . . , I, (10.4) 

where 

a(x) = ( exp(a 0 i + anxi H 1- a M i x M ), 

exp(a 02 + ai 2 xi H 1- a M ix M ), 

exp(a 0 fc + aikXi H h a M kX M )) • (10.5) 

Then by (10.5) and the properties of the Dirichlet distribution, we can see that 

E (ga| a ,x) 
g E(g i2 |a,x) 

E(g tl |a,x) 
lg E( ft3 |a,x) 

E( gi2 |a,x) 
l8 E( ft3 |a,x) 

so that we can interpret the difference ( a m k — a m h) as the effect of variable X m on the population 


= (ttoi — a 02) + (ail — ffll2)£l + ' ' ' + (flMl — (lM2)XMi 

= (<*01 - <*03) + (<*11 - <* 13 ) 2:1 H h (fflM 1 - CLM2.)x M , 

= (<*02 — <*03) + (<*12 — <*13)2:1 + • • ' + (<*M2 — <*M3)2’M, 
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log odds of the event “individual i has a trajectory near profile fc” versus the event “individual i 
has a trajectory near profile h.” Other specifications for the dependency of the distribution of the 
membership vectors on the time-invariant predictors are possible. See Manrique-Vallier (2010) for 
another parametric function of the covariates a(X) and Blei and Lafferty (2007); Galyardt (2012) 
for a logistic-normal prior that replaces the Dirichlet in (10.4). 

10.2.3 Second Extension: Modeling Mortality 

The presence of dementia is correlated with mortality. Patients with dementia are more likely to 
die than individuals of the same age without dementia (Bowen et al., 1996; Brodaty et al., 2012; 
Molsa et al., 1995). Information about survival times can help to reconstruct certain regions of some 
trajectory patterns for which information about diagnoses is not sufficient. By design, all subjects in 
the CHS-CS are older than 65 years, therefore any reference to the distribution of survival time refers 
to the conditional version, given that the subjects have already lived more than 65 years. Within each 
canonical profile, we model the random survival time variable (s) in excess of 65 years using the 
Weibull distribution with inverse scale parameter A k and shape parameter 5 k , for k = 1, . . . , K: 

w(s ; A fc) 5 k ) = SkX S k k s Sk ~ 1 e ~ ( ' sXk ' >6k . 


Our objective is to understand the survival patterns and their effects on the trajectories to de- 
mentia. FollowingManrique-Vallier (2010), we make the following assumptions: 1) the canonical 
profiles specify both trajectories to dementia and mortality distributions; and 2) given the member- 
ship vector g.j , the survival time s and the Diagnosis Y are independent. Therefore, the joint model 
for dementia and mortality can be written as: 


p(yi> Age) 


" T K 

nz 9ik fe k {yit\Age it ) 
_t=lk=l 


' K 

5 ~2gikW(Si\Xk,Sk ) , 
_k = 1 


where the first factor defines the trajectories for MCI and dementia, as described in the previous 
sections, and the second factor models the individual mortality patterns using the same number K 
of extreme profiles and membership vector g t . 


10.2.4 Full Bayesian Specification 

We complete the Bayesian specification of the model by specifying uninformative priors for the 
trajectory parameters /3ofc, /?u-, Ck and the parameters (ij k of the membership distribution G a : 

/3* fc ~ N( 0, 100) for k = l,2,...,K 

c fc ~ N( 0, 100) for k = 1, 2, . . . , K 

cijk ~ N( 0, 100) for j = 0, 1, , M and k = 1, 2, . . . , K. 

We also specify the following priors for the parameters of the Weibull distribution used to model 
the survival outcomes 


5 k ~ Gamma(l, 1) for k = 1, 2, . . . , K 

Xk ~ Gamma(l, 0.1) for k = 1, 2, . . . , K, 

which are considered diffuse, but realistic to model human survival times in excess of 65 years. 
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10.3 Application to the CHS Data: Results 

We fit the model described in the previous section to the CHS data using BUGS, a software package 
for Bayesian inference using Gibbs sampling (Lunn et al., 2009). The interested reader is referred 
to Manrique-Vallier (2010) for more details on an MCMC algorithm used in a similar setting. We 
report here primarily on a model with I\ = 3 canonical profiles (we discuss the selection of the 
number of profiles in Section 10.3.4 below). As described in Section 10.1.1 we recall that we con- 
sider the single outcome variable diagnoses (three levels: dementia, MCI, normal), the time-varying 
predictor Age, and four binary time-invariant predictors: X\ = Race (White), X-> = Education 
(Beyond High School), = Hypertension (Present), and A' j = APOE*4 (Present). 


10.3.1 The Trajectories Toward MCI and Dementia 

Figure 10.1 shows the trajectories of the three canonical profiles, determined by the parameters, 
whose estimated posterior means and standard deviations are shown in Table 10.1. The probability 
of dementia as a function of age is shown in the left-hand panel, and the probability of MCI is 
shown in the right-hand panel. The bands around the three profiles are pointwise posterior 95% 
credible bands and describe the uncertainty related to the estimation of these trajectories. They are 
constructed using the MCMC draws of the parameters /3ofc , ftik , and . 


Prob of Dementia for 3 extreme profiles 


Prob of MCI for 3 extreme profiles 




FIGURE 10.1 

K = 3 typical trajectories for dementia and MCI with pointwise posterior 95% credible bands. 

Profile 1 (continuous green curve), the ‘healthy’ profile, shows the typical or canonical trajectory 
of individuals whose peak probability of transitioning to MCI occurs between 95 and 100 years of 
age. This group has only a 50% probability of progressing to dementia by age 100. Profile 2 (dotted 
red curve), or ‘unhealthy’ profile, shows the typical or canonical trajectory of individuals who have 
a peak probability of progressing to MCI between the ages of 75 and 80, and a peak probability 
of progressing to dementia between the ages of 80 and 85. Finally, Profile 3 (dotted black curve), 
the ‘intermediate’ profile, shows the typical or canonical trajectory of individuals having a peak 
probability of progressing to MCI between 85 and 90 years of age, with a peak probability of 
progressing to dementia between 90 and 95 years. Figure 10.2 shows two individual trajectories as 
convex combinations of the canonical profiles as described by Equation (10.1). The trajectory closer 
to the unhealthy profile belongs to an individual with the following characteristics: non-white, less 
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Extreme: 

trajectory’s 

parameter: 

k=l (healthy) 

Estimate (sd) 

k=2 (unhealthy) 

k=3 (intermediate) 

Po* 

38.011 (0.521) 

38.904(0.172) 

47.913 (0.161) 

Pi* 

-0.388 (0.054) 

-0.483 (0.027) 

-0.531 (0.031) 

C* 

1.799 (0.334) 

1.647(0.135) 

4.197 (0.307) 


TABLE 10.1 

Posterior means and standard deviations for the parameters defining the three typical trajectories for 
dementia and MCI. 


educated, hypertensive, ApoE4 present. The trajectory closer to the ‘healthy’ profile belongs to an 
individual with the opposite characteristics: white, education beyond high school, non-hypertensive, 
no-ApoE4. 



FIGURE 10.2 

Two individual trajectories as weighted combinations of the three typical profiles. The trajectory 
closer to the ‘unhealthy’ profile belongs to an individual with the following characteristics: non- 
white, less educated, hypertensive, ApoE4 present. The trajectory closer to the ‘healthy’ profile 
belongs to an individual with the opposite characteristics: white, education beyond high school, 
non-hypertensive, no-ApoE4. 


10.3.2 The Effect of Additional Covariates on the Membership Vectors 

In order to understand the effects of the four time-invariant covariates on the closeness of an in- 
dividual to each of the three canonical trajectories, it is necessary to examine the results in Table 
10.2. The first three rows of the table show the effect of race on trajectory membership, and we 
see that for the comparison of Profiles 1 and 2, having race coded as white results in an increased 
probability of being near the healthy profile relative to being near the unhealthy profile (i.e., mean 
all — al2 = 1.21 with posterior 95% credible interval [0.82, 1.62]). In addition, race significantly 
increases the probability of being in the healthy profile relative to the intermediate profile. With 
regard to education, having more than a high school education resulted in increased closeness to the 
healthy profile relative to the unhealthy profile. However, education has no impact on the relative 


196 


Handbook of Mixed Membership Models and Its Applications 


closeness of the intermediate profile to either the healthy or unhealthy profiles. Hypertension is as- 
sociated with greater closeness to the unhealthy profile relative to the intermediate profile, while the 
presence of even a single copy of the APOE*4 allele increases the closeness of individuals to the 
unhealthy profile. 


Effect of 


Parameter: 

Estimate [95% Cl] 

Race 

Profile 1 Vs 2 
Profile 1 Vs 3 
Profile 2 Vs 3 

«11 — «12 

011 ~ «13 

012 ^ O13 

1.21 [0.82, 1.62] 
0.39 [-0.15, 0.84] 
-0.83 [-1.26, -0.47] 

Education 

Profile 1 Vs 2 

Profile 1 Vs 3 
Profile 2 Vs 3 

021 — 022 

021 ~ «23 

022 ~ 023 

0.50 [0.10, 0.92] 
0.26 [-0.15,0.81] 
-0.24 [-0.66, 0.17] 

Hypertension 

Profile 1 Vs 2 
Profile 1 Vs 3 
Profile 2 Vs 3 

031 ^ 032 

«31 ^ O33 

032 — 033 

-0.26 [-0.60, 0.07] 
0.18 [-0.22, 0.62] 
0.43 [0.13,0.79] 

ApoE4 

Profile 1 Vs 2 

Profile 1 Vs 3 
Profile 2 Vs 3 

O41 ^ O42 

041 ^ O43 

042 — O43 

-0.71 [-1.12,-0.26] 
0.12 [-0.31,0.60] 
0.83 [0.40, 1.23] 


TABLE 10.2 

Posterior means and 95% credible intervals for the parameters representing the effects of time- 
invariant predictors on the closeness of individual trajectories to the typical profiles. 


10.3.3 The Survival Trajectories 

We also estimated survival trajectories, shown in Figure 10.3, based on the results of Table 10.3. For 
Profiles 1 and 3 the survival curves are almost overlapping, indicating that for individuals close to 
these profiles, the probability of being alive is below 50% only after the age of 90. By contrast for 
individuals that are close to the unhealthy profile, the probability of being alive is below 50% before 
the age of 90. The difference in the age for a 50% probability of survival is approximately 5 years 
between the unhealthy profile, and the healthy and intermediate profiles. By contrast, the difference 
in the age at which the different profiles reach a 50% probability of dementia is approximately 10 
years between each trajectory, and at least 20 years between the unhealthy and healthy profiles (See 
Figure 10.1). 


Weibull’s: 

parameter: 

k=l (healthy) 

Estimate (sd) 
k=2 (unhealthy) 

k=3 (intermediate) 

A* 

3.887 (0.376) 

4.050 (0.256) 

5.397 (0.616) 

<5* 

0.034 (0.001) 

0.040 (0.001) 

0.034 (0.001) 


TABLE 10.3 

Posterior means and standard deviations for the parameters defining the three typical survival tra- 
jectories. 


10.3.4 Discussion: The Number of Typical Profiles 

Finally, in order to evaluate our decision to include only three canonical trajectories in our model, 
we used the method of posterior predictive testing (Gelman, 2007) to compare the models with 
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Survival trajectories for 3 extreme profiles 



FIGURE 10.3 

K = 3 typical survival trajectories with pointwise posterior 95% credible bands. 


K = 3 and K = 2 canonical profiles; we found that there were systematic differences between the 
model with I\ = 2 canonical profiles and the data. Using the estimated posterior distribution of the 
parameters, we replicated the original diagnoses, obtaining 1000 different simulated datasets. The 
model with two canonical profiles systematically overestimates the number of individuals that are 
diagnosed with MCI at least once in their life. The histograms in Figure 10.4 show this test statistic 
for 1000 simulated datasets for the model with K = 2 canonical profiles and the model with K = 3 
canonical profiles. The vertical bars indicate the true value of the test statistic: 338 individuals have 
been diagnosed with MCI at least once. Then we compared the original and simulated diagnoses 
using the proportions of individuals affected by MCI at every age. In Figure 10.5 the black lines 
represent the true proportion of individuals affected by MCI between the ages of 71 and 105, and 
the red lines represent the same proportions for 30 simulated datasets. There are some discrepancies 
between the true proportions and those that were replicated using the model with K = 2 canonical 
profiles, while the proportions simulated through the model with K = 3 canonical profiles show no 
apparent discrepancies. We also attempted a model with four canonical profiles, but the estimation 
process was very slow to converge, and produced a fourth additional canonical profile that essen- 
tially duplicated the healthier one. Based on these results we conclude that the model with K = 3 
canonical profiles best fits the data. 


10.4 Conclusion and Remarks 

We reported here the results of an MMTM analysis of the natural history of the development of 
dementia among individuals over the age of 65. We investigated the relative merits of three separate 
trajectories, and then identified the effects of four time-invariant covariates on the nearness of indi- 
viduals to each of these profiles. The results provide new insights into the natural history of AD and 
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K=2 - Number of individuals diagnosed 
with MCI in 1000 replicated datasets 



340 360 380 400 420 440 


K=3 - Number of individuals diagnosed 
with MCI in 1000 replicated datasets 



FIGURE 10.4 

Posterior predictive check. Number of individuals diagnosed with MCI. 



FIGURE 10.5 

Posterior predictive check. Proportions of individuals affected by MCI for 30 simulated datasets 
(thinner red lines) and true proportions (thicker black lines). 



related dementias, and may also provide evidence for a potential difference in the pathophysiology 
of the development of dementia as a function of age. 

One of the important characteristics of MMTMs is that individual subjects are assumed to have 
weighted membership in each of the three canonical trajectories. Thus, while it is theoretically 
possible for an individual to be an ideal or perfect member of one trajectory, in fact, as shown in 
Figure 10.2, individuals actually share characteristics of all three profiles to varying degrees. The 
main extension of the TGoM presented in this chapter involves a particular dependency between 
the distribution of the membership vectors and the time-invariant predictors added in the model. 
Particular values of the new covariates help to explain the closeness of individuals to one or another 
of the ideal trajectories. 

Our decision to include three canonical trajectories in our model was based on three separate 
factors: MCMC convergence time and cost, the model fit, and the interpretability of the trajec- 
tories. The three profiles’ models not only provided us with a good cost-benefit ratio in terms 
of processing time and model fit (as assessed by posterior predictive model checking), but also 
provided interpretable trajectories — an unhealthy trajectory proceeding very rapidly through MCI 
to dementia, a slow trajectory that does not become apparent until after the age of 90, and an 
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intermediate trajectory through MCI to dementia with a peak probability of a clinical syndrome 
in the late 80s. We can view the results of the analysis of survivorship as a kind of validation of the 
three profile model. Thus, the fact that the individuals with an ‘unhealthy’ trajectory are also the 
ones most likely to die sooner is consistent with the observation that demented individuals have a 
higher risk of death (Bowen et ah, 1996; Molsa et ah, 1995). 

If the use of MMTMs were extended to larger databases with appropriate follow-up and as- 
sessment schedules, we might be able to evaluate the relative contributions of other genetic factors, 
treatment history, and biomarkers on the natural history of dementia. 
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We present stochastic variational inference algorithms for two Bayesian nonnegative matrix factor- 
ization (NMF) models. These algorithms allow for fast processing of massive datasets. In particular, 
we derive stochastic algorithms for a Bayesian extension of the NMF algorithm of Lee and Se- 
ung (2001), and a matrix factorization model called correlated NMF, which is motivated by the 
correlated topic model (Blei and Lafferty, 2007). We apply our algorithms to roughly 1.8 million 
documents from the New York Times, comparing with online LDA (Hoffman et al., 2010b). 


11.1 Introduction 

In the era of “big data,” a significant challenge for machine learning research lies in developing ef- 
ficient algorithms for processing massive datasets (Jordan, 2011). In several modern data-modeling 
environments, algorithms for mixed membership and other hierarchical Bayesian models no longer 
have the luxury of waiting for Markov chain Monte Carlo (MCMC) samplers to perform the tens 
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of thousands of iterations necessary to approximately sample from the posterior, especially when 
per-iteration runtime is long in the presence of much data. Instead, stochastic optimization meth- 
ods provide another non-Bayesian learning framework that is better suited to big data environments 
(Bottou, 1998). 

This may seem unfortunate for Bayesian methods in machine learning, however, recent advances 
have combined stochastic optimization with hierarchical Bayesian modeling (Sato, 2001; Hoffman 
et al., 2010b; Wang et al., 201 1), allowing for approximate posterior inference for “big data.” Called 
stochastic variational Bayes , this method performs stochastic optimization on the objective function 
used in mean field variational Bayesian (VB) inference (Jordan et al., 1999; Sato, 2001; Hoffman 
et al., 2013). Like maximum likelihood (ML) and maximum a posteriori (MAP) inference methods, 
VB inference learns a point estimate that locally maximizes its objective function. But unlike ML 
and maximum MAP, which learn point estimates of a model’s parameters, VB learns a point estimate 
on a set of probability distributions on these parameters. 

Since maximizing the variational objective function minimizes the Kullback-Leibler divergence 
between the approximate posterior distribution and the true posterior (Jordan et al., 1999), varia- 
tional Bayes is an approximate Bayesian inference method. Because it is an optimization algorithm, 
it can leverage stochastic optimization techniques (Sato, 2001). This has recently proven useful in 
mixed membership topic modeling (Hoffman et al., 2010b; Wang et al., 201 1), where the number of 
documents constituting the data can be in the millions. However, the stochastic variational technique 
is a general method that can address big data issues for other model families as well. 

In this paper, we develop stochastic variational inference algorithms for two nonnegative matrix 
factorization models, which we apply to text modeling. Integrating out the latent indicators of a 
probabilistic topic model results in a nonnegative matrix factorization problem, and thus the rela- 
tionship to mixed membership models is clear. The first model we consider is a Bayesian extension 
of the well-known NMF algorithm of Lee and Seung (2001) with a KL penalty that has an equivalent 
maximum likelihood representation. This extension was proposed by Cemgil (2009), who derived a 
variational inference algorithm. We present a stochastic inference algorithm for this model, which 
significantly increases the amount of data that can be processed in a given period of time. 

The second model we consider is motivated by the correlated topic model (CTM) of Blei and 
Lafferty (2007). We first present a new representation of the CTM that represents topics and docu- 
ments as having latent locations in R”\ In this formulation, the probability of any topic is a function 
of the dot-product between the document and topic locations, which introduces correlations among 
the topic probabilities. The latent locations of the documents have additional uses, which we show 
with a document retrieval example. We carry this idea into the nonnegative matrix factorization 
domain and present a stochastic variational inference algorithm for this model as well. 

We apply our algorithms to 1.8 million documents from the New York Times. Processing this 
data in the traditional batch inference approach would be extremely expensive computationally 
since parameters for each document would need to be optimized before global parameters could 
be updated; MCMC methods are even less feasible. Using stochastic optimization, we show how 
stochastic VB can quickly learn the approximate posterior of these nonnegative matrix factorization 
models. Before deriving these inference algorithms, we give a general review of the stochastic VB 
approach. 

We organize the chapter as follows: In Section 11.2 we review the latent indicator approach 
probabilistic topic modeling, which forms the jumping-off point for the matrix factorization models 
we consider. In Section 11.3 we review the Bayesian extension to NMF and present an alternate 
mixture representation of this model that highlights to relationship to existing models (Blei et al., 
2003; Teh et al., 2007). In this section we also present correlated NMF, a matrix factorization model 
with similar objectives as the CTM. In Section 11.4 we review mean field variational inference in 
both its batch and stochastic forms. In Section 11.5 we present the stochastic inference algorithm 
for Bayesian NMF and correlated NMF. In Section 11.6 we apply the algorithm to 1.8 million 
documents from the New York Times. 
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11.2 Background: Probabilistic Topic Models 

Probabilistic topic models assume a probabilistic generative structure for a corpus of text docu- 
ments. They are an effective method for uncovering the salient themes within a corpus, which can 
help the user navigate large collections of text. Topic models have also been applied to a wide vari- 
ety of data modeling problems, including those in image processing (Fei Fei and Perona, 2005) and 
political science (Grimmer, J., 2010(@), and are not restricted to document modeling applications, 
though modeling text will be the focus of this chapter. 

A probabilistic topic model assumes the existence of an underlying collection of “topics,” each 
topic being a distribution on words in a vocabulary, as well as a distribution on these topics for 
each document. For a A'-topic model, we denote the set of topics as fik G Ay, where lf ::v is the 
probability of word index v given that a word comes from topic k. For document d, we denote the 
distribution on these K topics as 6d G A k, where 9 d k is the probability that a word in document d 
comes from topic k. 

For a corpus of D documents generated from a vocabulary of V words, let uj,in G V } 

denote the nth word in document d. In its most basic form, a latent-variable probabilistic topic 
model assumes the following hierarchical structure for generating this word, 

Wdn~ Discrete (fi Zdn ), z dn ~ d Discrete(6> d ). (11.1) 

The discrete distribution indicates that Pr {z dn == *| 9 d ) = Odi- 

Therefore, to populate a document with words, one first selects the topic, or theme of each 
word, followed by the word-value itself using the distribution indexed by its topic. In this chapter, 
we work within the “bag-of-words” context, which assumes that the Nd words within document 
d are exchangeable; that is, the order of words in the document does not matter according to the 
model. We next review two bag-of-words probabilistic topic models. 

Latent Dirichlet Allocation. A Bayesian topic model places prior distributions on /3k and (),/■ The 
canonical example of a Bayesian topic model is latent Dirichlet allocation (LDA) (Blei et al., 2003), 
which places Dirichlet distribution priors on these vectors, 

p k Dirichlet(c 0 ly /V ) , 9 d ~ d Dirichlet(a 0 l^). (11.2) 

The vector l a is an a-dimensional vector of ones. LDA is an example of a conjugate exponential 
family model; all conditional posterior distributions are closed-form and in the same distribution 
family as the prior. This gives LDA a significant algorithmic advantage. 

Correlated Topic Models. One potential drawback of LDA is that the Dirichlet prior on 9 d does not 
model correlations between topic probabilities. This runs counter to a priori intuition, which says 
that some topics are more likely to co-occur than others (e.g., topics on “politics” and “military” 
versus a topic on “cooking”). A correlated topic model (CTM) was proposed (Blei and Lafferty, 
2007) to address this issue. This model replaces the Dirichlet distribution prior on 9 d with a logistic 
normal distribution prior (Aitchison, 1982), 

9 dk = exp {y d k}/ Hf=i exp{y dj }, y d ~ Normal(0, E). (11.3) 

The covariance matrix E contains the correlation information for the topic probabilities. To allow 
for this correlation structure to be determined by the data, the covariance matrix E has a conjugate 
inverse Wishart prior. 


E ~ invWishart(A, m). 


(11.4) 
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The correlated topic model can therefore “anticipate” co-occuring themes better than LDA, but since 
the logistic normal distribution is not conjugate to the multinomial, inference is not as straightfor- 
ward. 

Matrix Factorization Representations. As mentioned, these hierarchical Bayesian priors are pre- 
sented within the context of latent indicator topic models. The distinguishing characteristic of this 
framework is the hidden data Zdn , which indicates the topic of word n in document d. Marginaliz- 
ing out these random variables, one enters the domain of nonnegative matrix factorization (Lee and 
Seung, 2001; Gaussier and Goutte, 2005; Singh and Gordon, 2008). In this modeling framework, 
the data is restructured into a matrix of nonnegative integers, X G M v x D . The entry X v( j is a count 
of the number of times word v appears in document d. Therefore, 

N d 

X v d = ^ l(w dn =V). (11.5) 

n= 1 

Typically, most values of X can be expected to equal zero. Several matrix factorization approaches 
exist for modeling this representation of the data. In the next section, we discuss two NMF models 
for this data matrix. 


11.3 Two Parametric Models for Bayesian N onnegative Matrix F actorization 

As introduced in the previous section, our goal is to factorize a VxD data matrix X of nonnegative 
integers. This matrix arises by integrating out the latent topic indicators associated with each word 
in a probabilistic topic model, thus turning a latent indicator model into a nonnegative matrix factor- 
ization model. The matrix to be factorized is not X, but an underlying matrix of nonnegative latent 
variables A G r) . Each entry of this latent matrix is associated with a corresponding entry in 
X, and we assume a Poisson data-generating distribution, with X v d ~ Poisson(A„ [ i). 

A frequently used model for X is simply called NMF, and was presented by Lee and Seung 
(1999). This model assumes A to be low-rank, the rank K being chosen by the modeler, and fac- 
torized into the matrix product A = f?0, with B G R^ xA and 0 G R Ax£) . Lee and Seung 
(2001) presented optimization algorithms for two penalty functions; in this chapter we focus on the 
Kullback-Leibler (KL) penalty. This KL penalty has a probabilistic interpretation, since it results in 
an optimization program for NMF that is equivalent to a maximum likelihood approximation of the 
Poisson generating model, 

{B*,Q*} = max P(X\B,Q) = max d Poisson(X vd \(B<d) vd ). 

5,0 5,0 ’ 

A major attraction of the NMF algorithm is the fast multiplicative update rule for learning B and 0 
(Lee and Seung, 2001). We next review the Bayesian extension of NMF (Cemgil, 2009). We then 
present a correlated NMF model that takes its motivation from the the latent-indicator correlated 
topic model. 

11.3.1 Bayesian NMF 

The NMF model with KL penalty was recently extended to the Bayesian setting under the name 
Bayesian NMF (Cemgil, 2009). This extension places gamma priors on all elements of B and 0. 
The generative process of Bayesian NMF under our selected parameterization is 

Xvd ^ Poissonj^jj^— i fivk^kd) ; 


( 11 . 6 ) 
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Gamma (c 0 /V,c 0 ), 9 kd ~ Gamma(a 0 , b 0 ). (11.7) 

Note that Y^ v Pvk 7 ^ 1 with probability 1. We also observe that this is not a matrix factorization 

approach to LDA, though f3 and 9 serve similar functions and have a similar interpretation, as 

discussed below. Therefore, it is still meaningful to refer to /3 :k as a “topic,” and we adopt this 
convention below. 

Just as the latent- variable probabilistic topic models discussed in Section 1 1 .2 have nonnega- 
tive matrix factorization representations, the reverse direction holds for Bayesian NMF. The latent- 
variable representation of Bayesian NMF is insightful since it shows a close relationship with LDA. 
Using the data-generative structure given in Equation (1 1.1) (with an additional ' to distinguish from 
Equation (11 .7)), the latent topics and distributions on these topics have the following generative 
process: 

Pk Dirichlet ( c 0 1 y /U) , 0 dk := f dk 6 dk (11.8) 

fdk := „ A - efc > ^ ~ Dirichlet (o 0 l /f), e k ~Gamma(c 0 , c 0 ). (11.9) 

£j=i 

The vectors (5 k and 0 d correspond to the topics and document distributions on topics, respectively. 
Note that J2 k 9 dk = 1. We see that when e k = 1, LDA is recovered . 1 Thus, when the columns of 
B are restricted to the probability simplex, that is, when e k = 1 with probability 1 for each k, one 
obtains the matrix factorization representation of LDA, also called GaP (Canny, 2004). Relaxing this 
constraint to gamma distributed random variables allows for a computationally simpler variational 
inference algorithm for the matrix factorization model, which we give in Section 11.5.1. 

The representation in Equations (11.8) and (11.9) shows the motivation for parameterizing the 
gamma distributions on /3 as done in Equation (11.7). The desire is for f to be close to 1, which 
results in a model close to LDA. This parameterization gives a good approximation; since Co is 
commonly set equal to a fraction of V in LDA, for example Co = 0.1 U, and because V is often 
on the order of thousands, the distribution of e k is highly peaked around 1 , with E\e k \ = 1 and 
Var( e k ) = 1/co- Though this latent variable representation affords some insight into the relationship 
between Bayesian NMF and LDA, we derive a cleaner inference algorithm using the hierarchical 
structure in Equations (1 1.6) and (1 1.7). 

11.3.2 Correlated NMF 

We next propose a correlated NMF model, which we build on an alternate representation of the 
correlated topic model (CTM) (Blei and Lafferty, 2007). To derive the model, we first present the 
alternate representation of the CTM. Following a slight alteration to the prior on the covariance 
matrix E, we show how we can “unpack the information” in the CTM to allow for a greater degree 
of exploratory data analysis. 

Recall that an inverse Wishart prior was placed on E, the covariance of the document-specific 
lognormal vectors, in Equation (11.4). Instead, we propose a Wishart prior, 

E ~ Wishart(cr 2 /if,m), (11.10) 

and assume a diagonal matrix parameter. Though this change appears minor, it allows for the prior to 
be expanded hierarchically in a way that allows the model parameters to contain more information 
that can aid in understanding the underlying dataset. 

There are two steps to unpacking the CTM. For the first step, we observe that one can 
sample E from its Wishart prior by first generating a matrix L £ K mx K , where each entry 

1 Th i s additional random variable e k arises out of the derivation by defining e k := 5” t /3 vk , with f3 vk drawn as in 

Equation (1 1.7). Nevertheless, e k can be shown to be independent of all other random variables. 
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Lij ^Normal(0, o 2 ), and then defining £ := L T L. It follows that £ has the desired Wishart distri- 
bution. 

Intuitively, with this expansion each topic now has a “location” 4> being the fcth column in 
L. That is, column k of topic matrix B now has an associated latent location 4 £ R m , where 
4 Normal (0, o 2 I m ). Note that when m < K, the covariance £ is not full rank. This provides 
additional modeling flexibility to the CTM, which in previous manifestations required a full rank 
estimate of £ (Blei and Lafferty, 2007). 

The second step in unpacking the CTM is to define an alternate representation of yd ~ 
Normal(0, L T L). We recall that this is the logistic normal vector that is passed to the softmax 
function in order to obtain a distribution on topics for document d, as described in Equation (1 1.3). 
We can again introduce Gaussian vectors, this time to construct y d : 

y d '■= L T u d , u d ~ Normal(0,/ m ). (11-11) 

The marginal distribution of y d , or p(y d \L) = f Rm p(y d \L, u d )p(u d ) du d , is aNormal(0, L T L) dis- 
tribution, as desired. To derive this marginal, first let y d \L, u d , e ~ Normal(L T i(d, e), next calculate 
p(y d \L,e) = Normal(0, el + L T L), and finally let e — > 0. As with topic location 4> the vector 
u d also has an interpretation as a location for document d. These locations are useful for search 
applications, as we show in Section 11.6. 

For the latent variable CTM, this results in a new hierarchical prior for topic distribution 9 d . The 
previous hierarchical prior of Equation (11.3) becomes the following, 

9 d k = eXP ^’ ^ , 4~ Normal(0,a 2 / m ), u d Normal(0, I m ). (11.12) 

/ , j ®xp \lj u d \ 

Transferring this into the domain of nonnegative matrix factorization, we observe that the normal- 
ization of the exponential is unnecessary. This is for a similar reason as with the random variables 
in Bayesian NMF, which made the transition from being Dirichlet distributed to gamma distributed. 
We also include a bias term a d for each document. This performs the scaling necessary to account 
for document length. 

The generative process for correlated NMF is similar to Bayesian NMF, with many distributions 
being the same. The generative process below for correlated NMF is 

X vd ~ Poisson(]Cf = i exp{a d + ^u d }), (11.13) 

Pvk ~Garnrna(c 0 /F, c 0 ), 4 *~Normal(0, a 2 I m ), u d ~Normal(0, I m ). 

The scaling performed by a d allows the product tjpv.d to only model random effects. We learn a 
point estimate of this parameter. 

The latent locations introduced to the CTM and this model require the setting of the latent space 
dimension to. Since we are in effect modeling an m-rank covariance matrix £ with these vectors, the 
variety of correlations decreases with m, and the model becomes more restrictive in the distributions 
on topics it can model. On the other hand, one should set m < I\, since for m > I\ there are m — K 
redundant dimensions. 


11.4 Stochastic Variational Inference 

Text datasets can often be classified as a “big data” problem. For example, Wikipedia currently 
indexes several million entries, and the New York Times has published almost two million articles 
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in the last 20 years. In other problem domains the amount of data is even larger. For example, a 
hyperspectral image can contain a hundred million voxels in a single data cube. With so much data, 
fast inference algorithms are essential. Stochastic variational inference (Sato, 2001; Hoffman et al., 
2013) is a significant step in this direction for hierarchical Bayesian models. 

Stochastic variational inference exploits the difference between local variables, or those asso- 
ciated with a single unit of data, and global variables, which are shared among an entire dataset. 
In brief, stochastic VB works by splitting a large dataset into smaller groups. These smaller groups 
can be quickly processed, with each iteration processing a new group of data. In the context of 
probabilistic topic models, the unit of data is a document, and the global variables are the topics 
(among other possible variables), while the local variables are document-specific and relate to the 
distribution on these topics. 

Recent stochastic inference algorithms developed for LDA (Hoffman et al., 2010b), the HDP 
(Wang et al., 2011), and other models (e.g., in Paisley et al., 2012) have shown rapid speed-ups in 
inference for probabilistic topic models. Though mainly applied to latent-indicator topic models thus 
far, the underlying theory of stochastic VB is more general, and applies to other families of models. 
One goal of this chapter is to show how this inference method can be applied to nonnegative matrix 
factorization, placing the resulting algorithms in the family of online matrix factorization methods 
(Mairal et al., 2010). Specifically, we develop stochastic variational inference algorithms for the 
Bayesian NMF and correlated NMF models discussed in Section 1 1.3. 

We next review the relevant aspects of variational inference that make deriving stochastic algo- 
rithms easy. Our approach is general, which will allow us to immediately derive the update rules for 
the stochastic VB algorithm for Bayesian NMF and correlated NMF. We focus on conjugate expo- 
nential models and present a simple derivation on a toy example — one for which online inference is 
not necessary, but which allows us to illustrate the idea . 2 

11.4.1 Mean Field Variational Bayes 

Mean field variational inference is an approximate Bayesian inference method (Jordan et al., 1999). 
It approximates the full posterior of a set of model parameters p($|X) with a factorized distribution 
Q( ( b) = q{4>i) by minimizing their Kullback-Liebler divergence. This is done by maximizing 

the variational objective C with respect to the variational parameters T of Q. The objective function 
is 

£(X,4>) =E Q [lnp(X,$)]+H[Q]. (11.14) 

When the prior and likelihood of all nodes of the model falls within the conjugate exponential 
family, variational inference has a simple optimization procedure (Winn and Bishop, 2005). We il- 
lustrate this with the following example, which we extend to the stochastic setting in Section 1 1 .4.2. 
This generic example gives the general form of the stochastic variational inference algorithm, which 
we later apply to Bayesian NMF and correlated NMF. 

Consider D independent samples from an exponential family distribution p{x\rf), where 77 is the 
natural parameter vector. The data likelihood under this model has the standard form 

r d 

h{x d ) 

The sum of vectors t(xd) forms the sufficient statistic of the likelihood. The conjugate prior on // 
has a similar form 

p(v \x, v ) = f(.Xi v ) ex P { 1 l T X~ v A{ri ) } , (11.15) 

2 Although Bayesian NMF is not in fact fully conjugate, we will show that a bound introduced for tractable inference 
modifies the joint likelihood such that the model effectively is conjugate. For correlated NMF, we will also make adjustments 
for non-conjugacy. 


p(x\v) = 


n 

,d=_ 
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and conjugacy motivates selecting a q distribution in this same family, 

q(v \x', v ') = fix', y ") exp {V T X' ~ y 'Mv)} ■ (11.16) 


After computing the variational lower bound given in Equation (11.14), which can be done explicitly 
for this example, inference proceeds by taking gradients with respect to variational parameters, in 
this case the vector ip := [\' T ,z/] T , and then setting to zero to find their updated values. For 
conjugate exponential family models, this gradient has the general form 


V^£(X,' T) 


[0 2 ln f(x'y) 0 2 ln/( X V)l 


D 

dx'dx ,T dx'dv' 


X + > , t(x d ) - X 

d 2 In fix' y) d 2 ln/(xV) 


d—1 

dv'd\ ,T dv' 2 - 


v + D - v' 


(11.17) 


and can be explicitly derived from the lower bound. Setting this to zero, one can immediately 
read off the variational parameter updates from the right vector, which in this case are x! = X + 
Tfd=i t( x d) and z/ = v + D. Though the matrix in Equation (11.17) is often very complicated, it is 
superfluous to batch variational inference for conjugate exponential family models. In the stochastic 
optimization of Equation (11.14), however, this matrix cannot be similarly ignored. 

We show a visual representation of batch variational inference for Bayesian matrix factorization 
in Figure 11.1. The above procedure repeats for each variational Q distribution; first for all distri- 
butions of the right matrix, followed by those of the left. We note that, if conjugacy does not hold, 
gradient ascent can be used to optimize ip. 
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FIGURE 11.1 

A graphic describing batch variational inference for Bayesian nonnegative matrix factorization. For 
each iteration, all variational parameters for document specific (local) variables are updated first. 
Using these updated values, the variational parameters for the global topics are updated. When 
there are many documents being modeled, i.e., when the number of columns is very large, step 1 in 
the image can have a long runtime. 


11.4.2 Stochastic Optimization of the Variational Objective 

Stochastic optimization of the variational lower bound involves forming a noisy gradient of C using 
a random subset of the data at each iteration. Let C t C {1, . . . , D} index this subset at itera- 
tion t. Also, let cpd be the model variables associated with observation Xd and x the variables 
shared among all observations. In Table 1 1.1, we distinguish the local from the global variables for 
Bayesian NMF and correlated NMF. 


Bayesian Nonnegative Matrix Factorization with Stochastic Variational Inference 


213 


Model Local variables Global variables 

Bayesian NMF {Okd}k=i:K,d=i-.D {Pvk}v=.i:V,k=i:K 
Correlated NMF {u d , a d } d=1:D {/3 v k,h}v=i-.v,k=i-.K 

TABLE 11.1 

Local and global variables for the two Bayesian nonnegative matrix factorization models considered 
in this chapter. Stochastic variational inference partitions the local variables into batches, with each 
iteration of inference processing one batch. Updates to the global variables follow each batch. 


The stochastic variational objective function C s is the noisy version of C formed by selecting a 
subset of the data, 

£ s {X Ct ^) = ^2 E Q[ ln P( x d,<l>d\$x)} +E Q [lnp($ x )] +H[Q], (11.18) 

I *' d&C t 

This constitutes the objective function at step t. By optimizing C s , we are optimizing C in expecta- 
tion. That is, since each subset Ct is equally probable, with p(Ct) = (\c t \) - and since d £ Ct for 

( | CA - 1 ) t ^ le (|C ( |) P oss ible subsets, it follows that 


E KCt) [£ s (X Cf ,vP)]=£(Xv]/). 


(11.19) 


Therefore, by optimizing C s we are stochastically optimizing C. Stochastic variational optimization 
proceeds by optimizing the objective in Equation (11.18) with respect to ip d , d £ C t , followed by 
an update to 'P x that blends the new information with the old. For example, in the simple conjugate 
exponential model of Section 1 1.4.1, the update of the vector ip := [x' T • v'] T at iteration t follows 
a gradient step, 

iPt = A-i+PtGV 4 ,C s (X Ct ,V). (11.20) 

The matrix G is a positive definite preconditioning matrix and p t is a step size satisfying 1 Pt = 
oo and Jf 'pj , pf < oo, which ensures convergence (Bottou, 1998). 

The key to stochastic variational inference for conjugate exponential models is in selecting G. 
Since the gradient of C s has the same form as Equation (11.17), the difference being a sum over 
d £ Ct rather than the entire dataset, G can be set to the inverse of the matrix in ( 1 1 . 17) to allow for 
cancellation. An interesting observation is that this matrix is 


f d 2 In q(i]\ip)\ 1 
V dipdiP T ) ’ 


( 11 . 21 ) 


which is the inverse Fisher information of the variational distribution q(p\ip). This setting of G gives 
the natural gradient of the lower bound, and therefore not only simplifies the algorithm, but gives an 
efficient step direction Amari (1998); Sato (2001). We note that this is the setting of G given in the 
stochastic variational algorithm of Sato (2001) and was used in Hoffman et al. (2010b) and Wang 
et al. (2011) for online LDA and HDP, respectively. 

In the case where the prior-likelihood pair does not fall within the conjugate exponential family, 
stochastic variational inference still proceeds as described, instead using an appropriate G for the 
gradient step in Equation (11.20). The disadvantage of this regime is that the method truly is a 
gradient method, with the attendant step size issues. Using the Fisher information gives a clean and 
interpretable update. 

This interpretability is seen by returning to the example in Section 11.4.1, where the stochastic 
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variational parameter updates are 


Xt 


(1 ~ Pt)x't - 1 + Pt 
u + D. 



D 

\Ct\ 


X! t&d) f » 
d£Ct J 


(11.22) 


We see that, for conjugate exponential family distributions, each step of stochastic variational in- 
ference entails a weighted averaging of sufficient statistics from previous data with the sufficient 
statistics of new data scaled up to the size of the full dataset. We show a visual representation of 
stochastic variational inference for Bayesian matrix factorization in Figure 1 1.2. 
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FIGURE 11.2 

A graphic describing stochastic variational inference for Bayesian nonnegative matrix factorization. 
From the larger dataset, first select a subset of data (columns) uniformly at random, indexed by C t 
at iteration f; for clarity we represent this subset as a contiguous block. Next, fully optimize the 
local variational parameters for each document. Because the subset is much smaller than the entire 
dataset, this step is fast. Finally, update the global topic variational parameters using a combination 
of information from the local updates and the previously seen documents, as summarized in the 
current values of these global variational parameters. 


11.5 Variational Inference Algorithms 

We present stochastic variational inference algorithms for Bayesian NMF and correlated NMF. Ta- 
ble 11.2 contains a list of the variational q distributions we use for each model. The variational 
objective functions for both models are given below. Since an expectation with respect to a delta 
function is simply an evaluation at the point mass, we write this evaluation for correlated NMF: 

B-NMF: C = E q lnp{X\B, 0, a) + E q In p{B) + E q lnp(0) + H[Q] 

C-NMF: C= E q \np{X\B,L,U,at) + E q lnp(B) + lnp(L) + \np(U) +H[Q]. 


As is evident, mean field variational inference requires being able to take expectations of the log joint 
likelihood with respect to the predefined q distributions. As frequently occurs with VB inference, 
this is not possible here. We adopt the common solution of introducing a tractable lower bound for 
the problematic function, which we discuss next. 


Bayesian Nonnegative Matrix Factorization with Stochastic Variational Inference 


215 


Model Variational q distributions 

Bayesian NMF q(/3 vk ) = Gamma(fi vk \g vk ,h vk ) 
q(@kd) Gaxnn\a(Q k d\cLkdi b kc f) 
Correlated NMF q(/3 vk ) = Gamma(/3 vk \g vk , h vk ) 
g(4) = St k , q(u k ) = 5 Uk 


TABLE 11.2 

The variational q distributions for Bayesian NMF and correlated NMF. 


A Lower Bound of the Variational Objective Function 

For both Bayesian NMF and correlated NMF, the variational lower bound contains an intractable 
expectation in the log of the Poisson likelihood. To speak in general terms about the problem, let u> k d 
represent the document weights. This corresponds to 6 k( i in Bayesian NMF and to exp{ad + £ k Ud} 
in correlated NMF. 

The problematic expectation is E 9 ln^^ =1 (3 vk 0J k d . Given the concavity of the natural loga- 
rithm, we introduce a probability vector p( v °) £ Ax for each ( v , d) pair in order to lower-bound 
this function. 

In ( ^2 Pvk^kd\ > ^2 In {PvkUkd) - ^2 p k' d) ln Pk’ d ' > ■ (11 -23) 

\fc=i / fc=i fc= l 

All expectations of this new function are tractable, and the vector p >vd) is an auxiliary parameter that 
we optimize with the rest of the model. After each iteration, we optimize this auxiliary probability 
vector to give the tightest lower bound. This optimal value is 

P k ’ d) oc exp{E 9 [ln/4fc] + EJlnwfcd]}. (11.24) 

Section 11.5.1 contains the functional forms of these expectations. 

11.5.1 Batch Algorithms 

Given the relationship between batch and stochastic variational inference, we first present the batch 
algorithm for Bayesian NMF and correlated NMF, followed by the alterations needed to derive their 
stochastic algorithms. For each iteration of inference, batch variational inference cycles through the 
following updates to the parameters of each variational distribution. 

Parameter Update for q(/3 vk ) 

The two gamma distribution parameters for this q distribution (Table 1 1 .2) have the following up- 
dates, 


Qvk 

i yB v rS- vd ^ 
y + Aid = 1 vdP k > 


(11.25) 

h vk 

= c o + ]Cd=i E 9 [4d]) 

(Bayesian NMF) 

(11.26) 

hvk 

= c 0 + Y^d=i ex v{ a d + ?l u d}- 

( Correlated NMF) 

(11.27) 


Expectations used in other parameter updates are ~E q [f! vk \ = g vk /h vk and E 9 [ln/3„fc] = if(g vk ) — 
In h vk . 
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Parameter Update for q{9 kd ) ( Bayesian NMF) 

The two gamma distribution parameters for this q distribution (Table 1 1 .2) have the following up- 
dates, 

a kd = a 0 + ^ =1 X vd p ( k% (H-28) 

b k d = + dl-29) 

Expectations used in other parameter updates are ¥j q [6 kd \ = cikd/bkd and E g [ln4d] = ip(a kdj ) — 
In b kd . 


Parameter Updates for q(I k ) an d q(u d ) ( Correlated NMF) 

Since these parameters do not have closed-form updates, we use the steepest ascent gradient method 
for inference. The gradients of C with respect to £ k and u k are 

V D 

EE(* vdP k d) - E q[Pvk\ exp{a d + u d - a 2 4, (11.30) 

v=l d—1 

V K 

EE(E dP k d) - E q [fivk\ exp {a^ + elu d }^j 4 - u d . (1 1.31) 

v=l k—1 

For each variable, we take several gradient steps to approximately optimize its value before moving 
to the next variable. 

Paramter Update for a d ( Correlated NMF) 

The point estimate for a d has the following closed-form solution, 

oid = In XEi X »d ~ In ^q[/3vk\ exp {£ k u d }. (1 1.32) 

We update this parameter after each step of u d . 

11.5.2 Stochastic Algorithms 

By inserting the lower bound (1 1.23) into the log joint likelihood and then exponentiating, one can 
see that the likelihood /3 vk is modified to form a conjugate exponential pair with its prior for both 
models. Hence, the discussion and theory of natural gradient ascent in Section 1 1 .4.2 applies to both 
models with respect to the topic matrix B. For correlated NMF, this does not apply to the global 
variable £. For this variable, we use the alternate gradient method discussed in Section 1 1.4.2. 

After selecting a subset of the data using the index set C t £ {1, . . . , D}, stochastic inference 
starts by optimizing the local variables, which entails iterating between the parameter updates for 
9 d and for Bayesian NMF, and u d and p ivil> for correlated NMF. Once these parameters have 
converged, we take a single step in the direction of the natural gradient to update the distributions on 
and use Newton’s method in the step for 4- We use a step size of the form p t = (to + t)~ K for 
to > 0 and k £ (.5, 1]. This step size satisfies the necessary conditions for convergence discussed in 
Section 1 1.4.2 (Bottou, 1998). We also recall from Section 1 1.4.2 that D is the corpus size to which 
each batch Ct is scaled up. 

Stochastic Update of q((3 vk ) 

As with batch inference, this update is similar for Bayesian NMF and correlated NMF. In keeping 
with the generalization at the beginning of this section, we let oj kd stand for 9 kd or exp {ad + £ k u d }, 


= 

V M(i £ = 
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depending on the model under consideration. The update of the variational parameters of q(/3 v k ) is 


(*) 

vk 


h 


(*) 

vk 


(1 - Pt)g[! k 11 + Pt "I y + j^-j ^2 X vd p' k 


(vd) 


d£Ct 


(1 - Pt)h^ k 1} + Pt | C 0 + i^-T ^2 E ?[ w fcd] | • 

v d£Ct ) 


(11.33) 

(11.34) 


As expected from the theory, the variational parameters are a weighted average of their previous 
values and the sufficient statistics calculated using batch C). 


Stochastic Update of q{Ik) ( Correlated NMF) 

Stochastic inference for correlated NMF has an additional global variable in the location of each 
topic. The posterior of this variable — a point estimate — is not conjugate with the prior, and therefore 
we do not use the natural gradient stochastic VB approach discussed in Section 1 1.4.2. However, as 
pointed out in that section, we can still perform stochastic inference according to the general update 
given in Equation (11.20). With reference to this equation, we set the preconditioning matrix G to 
be the inverse negative Hessian and update Ik at iteration t as follows, 

= e { t 1] + PtGS7e k £s(X Ct ,Uc t ,* Ct ,L,B ), (11.35) 

D . y , 

G a / m ^ ^ T Ik tt d /ti d u d . 

I G *I d£C t v = 1 

In batch inference, we perform gradient (steepest) ascent optimization as well. A key difference 
there is that we fully optimize each Ik and u d before moving to the next variable — indeed, for 
stochastic VB we still fully optimize the local variable u d with steepest ascent during each iteration. 
For stochastic learning of Ik, however, we only take one step in the direction of the gradient for the 
stochastic update of Ik before moving on to a new batch of documents. In an attempt to take the 
best step possible, we use the Hessian matrix to construct a Newton step. 


11.6 Experiments 

We perform experiments using stochastic variational inference to learn the variational posteriors of 
Bayesian NMF and correlated NMF. We compare these algorithms with online LDA of Hoffman 
et al. (2010b). We summarize the dataset, parameter settings, experimental setup, and performance 
evaluation method below. 

Dataset. We work with a dataset of 1,819,268 articles from the New York Times newspaper. The ar- 
ticle dates range from January 1987 to May 2007. We use a dictionary of V = 8000 words learned 
from the data, and randomize the order of the articles for processing. 

Parameter Settings. In all experiments, we set the parameter Co = 0.05 1/. When learning a K- 
topic model, for online LDA we set the parameter for the Dirichlet distribution on topic weights to 
cio = 1/K, and for Bayesian NMF we set the weight parameters to do = 1/AT and b 0 = 1/A'. 
For correlated NMF we use a latent space dimensionality of m = 50 for all experiments, and set 
a 2 = 1/m. 
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Experimental setup. We compare stochastic inference for Bayesian NMF and correlated NMF with 
online LDA. For all models, we perform experiments for I\ € {50, 100, 150} topics. We also eval- 
uate the inference method for several batch sizes using | C t \ £ {500, 1000, 1500, 2000}. We use a 
step size of p t = (1 + t)~ °' 5 . Stochastic inference requires initialization of all global variational 
parameters. For topic -related parameters, we set the variational parameters to be the prior plus a 
Uniform(O.l) random variable that is scaled to the size of the corpus, similar to the scaling per- 
formed on the statistics of each batch. For correlated NMF, we sample tk ~ Normal(0, cr 2 I m ). 

Performance Evaluation. To evaluate performance, we hold out every tenth batch for testing. On 
each testing batch, we perform threefold cross validation by partitioning each document into thirds. 
Using the current values of the global variational parameters, we then train the local variables on 
two-thirds of each document and predict the remaining third. For prediction, we use the mean of 
each variational q distribution. We average the per-word log-likelihoods of all words tested to quan- 
tify the performance of the model at the current step of inference. After testing the batch, stochastic 
inference proceeds as before, with the testing batch processed first — this doesn’t compromise the 
algorithm since we make no updates to the global parameters during testing, and since every testing 
batch represents a new sample from the corpus. 

Experimental Results. Figure 11.3 contains the log-likelihood results for the threefold cross vali- 
dation testing. Each plot corresponds to a setting of K and \C\\. From the plots, we can see how 
performance trends with these parameter settings. We first see that performance improves as the 
number of topics increases within our specified range. Also, we see that as the batch size increases, 
performance improves as well, but appears to reach a saturation point. At this point, increasing the 
batch size does not appear to significantly improve the direction of the stochastic gradient, meaning 
that the quality of the learned topics remains consistent over different batch sizes. 

Performance is roughly the same for the three models considered. For online LDA and Bayesian 
NMF, this perhaps is not surprising given the similarity between the two models discussed in Sec- 
tion 1 1.3.1. Modeling topic correlations with correlated NMF does not appear to improve upon the 
performance of online LDA and Bayesian NMF Nevertheless, correlated NMF does provide some 
additional tools for understanding the data. 

In Figure 11.4 and Table 11.3, we show results for correlated NMF with K = 150 and 
\Ct\ = 1000. In Table 11.3 we show the most probable words from the 40 most probable top- 
ics. In Figure 11.4 we show the correlations learned using the latent locations of the topics. The 
correlation between topic i and j is calculated as 

Corr(topic 4 , topic;) = tj/WihWjWj- (11.36) 

The learned correlations are meaningful. For example, two negatively correlated topics, topic 19 
and topic 25, concern the legislative branch and football, respectively. On the other hand, topic 12, 
concerning baseball, correlates positively with topic 25 and negatively with topic 19. The ability to 
interpret topic meanings does not decrease as their probability decreases, as we show in Table 1 1 .4. 
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FIGURE 11.3 

Performance results for Bayesian NMF, correlated NMF, and online LDA on the New York Times 
corpus. Results are shown for various topic settings and batch sizes. In general, performance is 
similar for all three models. Performance tends to improve as the number of topics increases. There 
appears to be a saturation level in batch size, that being the point where the increasing the number of 
documents does not significantly improve the stochastic gradient. Performance on this dataset does 
not appear to improve significantly as \Ct \ increases over 1000 documents. 
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FIGURE 11.4 

Correlations learned by correlated NMF for K = 100 and |C* | = 1000. The figure contains corre- 
lations for the 40 most probable topics sorted by probability. Table 11.3 contains the most probable 
words associated with each topic in this figure. A green block indicates positive correlation, while a 
red block indicates negative correlations. The size of the block increases with an increasing corre- 
lation. The diagonal corresponds to perfect correlation, and is left in the figure for calibration. 
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Correlated NMF: Most probable words from the most probable topics 

1. think, know, says, going, really, see, things, lot, got, didn 

2. life, story, love, man, novel, self, young, stories, character, characters 

3. wife, beloved, paid, notice, family, late, deaths, father, mother 

4. policy, issue, debate, right, need, support, process, act, important 

5. percent, prices, market, rate, economy, economic, dollar, growth, rose 

6. government, minister, political, leaders, prime, officials, party, talks, foreign, 
economic 

7. companies, billion, percent, corporation, stock, share, largest, shares, business, 
quarter 

8. report, officials, department, agency, committee, commission, investigation, in- 
formation, government 

9. going, future, trying, likely, recent, ago, hopes, months, strategy, come 

10. social, professor, society, culture, ideas, political, study, harvard, self 

1 1. court, law, judge, legal, justice, case, supreme, lawyers, federal, filed 

12. yankees, game, mets, baseball, season, run, games, hit, runs, series 

13. trial, charges, case, prison, jury, prosecutors, federal, attorney, guilty 

14. him, theater, movie, play, broadway, director, production, show, actor 

15. best, need, better, course, easy, makes, means, takes, simple, free 

16. party, election, campaign, democratic, voters, candidate, republican 

17. town, place, small, local, visit, days, room, road, tour, trip 

18. military, army, forces, troops, defense, air, soldiers, attacks, general 

19. senate, bill, house, congress, committee, republicans, democrats 

20. went, came, told, found, morning, away, saw, got, left, door 

21. stock, investors, securities, funds, bonds, market, percent, exchange 

22. asked, told, interview, wanted, added, felt, spoke, relationship, thought 

23. restaurant, food, menu, cook, dinner, chicken, sauce, chef, dishes 

24. school, students, education, college, teachers, public, campus 

25. team, season, coach, players, football, giants, teams, league, game, bowl 

26. family, father, mother, wife, son, husband, daughter, friends, life, friend 

27. inc, net, share, reports, qtr, earns, sales, loss, corp. earnings 

28. tax, budget, billion, spending, cuts, income, government, percent 

29. art, museum, gallery, artists, show, exhibition, artist, works, paintings 

30. police, officers, man, arrested, gun, shot, yesterday, charged, shooting 

31. executive, chief, advertising, business, agency, marketing, chairman 

32. game, points, knicks, basketball, team, nets, season, games, point, play 

33. bad, far, little, hard, away, end, better, keep, break, worse 

34. study, cancer, research, disease, tests, found, blood, test, cells 

35. business, sold, market, buy, price, sell, selling, bought, sale, customers 

36. public, questions, saying, response, criticism, attack, news, answer 

37. street, park, avenue, west, east, side, village, neighborhood, central 

38. feet, right, head, foot, left, side, body, eye, see, eyes 

39. system, technology, research, program, industry, development, experts 

40. music, band, songs, rock, jazz, song, pop, singer, album, concert 

TABLE 11.3 

Most probable words from the most probable topics for K = 150, \Ct.\ = 1000. 
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Correlated NMF: Most probable words from less probable topics 

41. television, films, network, radio, cable, series, show, fox, nbc, cbs 
44. minutes, salt, add, oil, pepper, cup, heat, taste, butter, fresh 
48. computer, internet, technology, software, microsoft, computers, digital, elec- 
tronic, companies, information 

61. john, thomas, smith, scott, michael, james, lewis, howard, kennedy 

83. iraq, iran. iraqi, hussein, war, saudi, gulf, saddam, baghdad, nations 

84. mayor, governor, giuliani, council, pataki, bloomberg, cuomo, assembly 
98. rangers, game, goal, devils, hockey, games, islanders, team, goals, season 

111. israel, israeli, Palestinian, peace, arab, arafat, casino, bank, west 
114. china, Chinese, india, korea, immigrants, immigration, asia, beijing 
134. british, london. england, royal, prince, sir, queen, princess, palace 
141. catholic, roman, irish, pope, ireland, bishop, priest, cardinal, john, paul 

TABLE 11.4 

Some additional topics not given in Table 11.3. Topics with less probability still capture coherent 
themes. Topics not shown were similarly coherent. 


11.7 Conclusion 

We have presented stochastic variational inference algorithms for two Bayesian nonnegative ma- 
trix factorization models: Bayesian NMF (Cemgil, 2009), a Bayesian extension of NMF (Lee and 
Seung, 1999); and correlated NMF, a new matrix factorization model that takes its motivation for 
the correlated topic model (Blei and Lafferty, 2007). Many other nonnegative matrix factorization 
models are candidates for stochastic inference, for example those based on Bayesian nonparametric 
priors such as the gamma process (Hoffman et al., 2010a) and the beta process Paisley et al. (201 1). 
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Topic models are a versatile tool for understanding corpora, but they are not perfect. In this chapter, 
we describe the problems users often encounter when using topic models for the first time. We begin 
with the preprocessing choices users must make when creating a corpus for topic modeling for the 
first time, followed by options users have for running topic models. After a user has a topic model 
learned from data, we describe how users know whether they have a good topic model or not and 
give a summary of the common problems users have, and how those problems can be addressed and 
solved by recent advances in both models and tools. 


225 



226 


Handbook of Mixed Membership Models and Its Applications 


12.1 Introduction 

Topic models are statistical models for learning the latent structure in document collections, and 
have gained much attention in the machine learning community over the last decade. Topic models 
improve the ways users find and discover text content in digital libraries, search interfaces, and 
across the web, through their ability to automatically learn and apply subject tags to documents in a 
collection. However, this potential requires practitioners to overcome the problems often associated 
with topic models: when to use them, how to know when there are problems, how to fix those 
problems, and how to make topic models more useful. 

Topic modeling is an increasingly popular framework for simultaneously soft clustering terms 
and documents into a fixed number of topics, which take the form of a multinomial distribution over 
terms in the document collection. Topic models are useful for a variety of research tasks and user- 
facing applications described below. We start by introducing notation for the original generative 
topic model, latent Dirichlet allocation (LDA) (Blei et al., 2003). 

Latent Dirichlet allocation and its extensions form one popular class of topic models and will 
be the basis of discussion for this chapter. The LDA topic model is based on the assumption that 
documents have multiple topics. 

In LDA topic modeling, each of D documents in the corpus is modeled as a discrete distribution 
over T latent topics, and each topic is a discrete distribution over the vocabulary of W words. In the 
LDA topic model, the number of topics T is fixed and specified by the modeler. For document d , the 
distribution over topics, 0 t pj, is drawn from a Dirichlet distribution Dir [a], where a might either be 
a symmetric constant vector (say aol) or a hyperparameter with variable values (say («i, ..., cut)) 
which can be estimated. Likewise, each distribution over words, is drawn from a Dirichlet 
distribution Dir[/?]. 

For the /th token in a document, a topic assignment z.,,i is drawn from 0 t \ d and the word, x-ui, is 
drawn from the corresponding topic, <j> w \ Zid ■ Hence, the generative process in LDA is given by 

0 t \d ~ Dir [a] ~ Dir[fl (12.1) 

Zid Mult[0 t , d ] ^ Mult[f)* WJ 1 ] • (12*2) 

We can compute the posterior distribution of the topic assignments via Gibbs sampling or vari- 
ational inference. Given samples from the posterior distribution we can compute point estimates 
of the document-topic proportions 9 t \ d and the word-topic probabilities We will henceforth 
denote cj> t as the vector of word probabilities for a given topic t. 

The original LDA topic model has been extended in dozens of ways. Most of the extensions are 
a result of addressing a potential limitation of LDA, or taking advantage of an opportunity made 
available by additional data. Some notable extensions include: the correlated topic model (Blei and 
Lafferty, 2005); the nonparametric topic model, or hierarchical Dirichlet process model (Teh et al., 
2006); the hierarchical topic model (Blei et al., 2007); and the dynamic topic model (Blei and 
Lafferty, 2006). To a large extent, these particular extensions have not directly addressed some of 
the usability issues we focus on in this chapter. 

Nevertheless, there has been a thriving cottage industry adding more and more information to 
topic models to correct some of the shortcomings we are interested in, either by modeling per- 
spective (Paul and Girju, 2010; Lin et al., 2006), syntax (Wallach, 2006; Gruber et al., 2007), or 
authorship (Rosen-Zvi et al., 2004; Dietz et al., 2007). Similarly, there has been an effort to inject 
semantic knowledge into topic models (Boyd-Graber et al., 2007). 
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12.1.1 Using Topic Models 

In the academic literature, topic modeling has been demonstrated to be highly effective in a wide 
range of research-oriented tasks, including multi-document summarization (Haghighi and Vander- 
wende, 2009), word sense discrimination (Brody and Lapata, 2009), sentiment analysis (Titov and 
McDonald, 2008), machine translation (Eidelman et al., 2012), information retrieval (Wei and Croft, 
2006), discourse analysis (Purver et al., 2006; Nguyen et al., 2012), and image labeling (Fei-Fei and 
Perona, 2005). In these tasks the topics are used as features in some larger algorithm, and not as 
first-order outputs of interest. 

Beyond these research-type tasks, topic modeling has been demonstrated in several user-facing 
applications. Here, the topics themselves are of direct interest. Applications range from search and 
discovery interfaces to other types of collection analysis interfaces. There are several noteworthy ex- 
amples, including two from the U.S. funding agencies, NIH and NSF. The NIH Map ViewerTopic 1 is 
both a topic-based search interface and a map visualizing the research funded by NIH (Talley et al., 
2011). The STAR METRICS Portfolio Explorer 2 features topics describing NSF-funded research. 
Another example is the topic model browser for the journal Science. 3 

The remainder of this chapter is organized as follows. In this section, we further introduce topic 
modeling: how one goes from raw data to a topic model. In Section 12.2, we talk about problems 
and issues with topic modeling. In Section 12.3, we discuss diagnostics that are useful for detecting 
and measuring these problems. Finally, in Section 12.4 we review new methods aimed at improving 
the performance and utility of topic models in addition to those aimed at addressing some of their 
problems. 

12.1.2 Preprocessing Text Data 

Topic models take documents that contain words as input. This seems simple enough, but often 
the process of going from a source document to a form that can be understood by topic models 
drastically changes the final output. Suppose, for example, that we wanted to build a topic model 
using Wikipedia as our data source. How would we turn that into a sequence of words that could be 
used as input to a topic model? 

Readers experienced with data processing and natural language processing can safely skip to 
Section 12.1.3, where we assume that we have the necessary input data for topic modeling. 

First, let’s take a look at what an individual Wikipedia page looks like: 4 


< ! DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http : //www . w3 . org/TR/ xhtml 1 /DTD /xhtml 1 -transitional . dtd" > 
<html lang="en" dir="ltr" class=" client-no js" 
xmlns="http : //www . w3 . org/1 999/xhtml "> <head> <title>Princess 
Ida - Wikipedia, the free encyclopedia</title> <meta 
http-equiv=" Content -Type " content=" text /html ; charset=UTF-8 " 
/> <meta http-equiv="Content-Style-Type" content="text/css" 

/> <meta name="generator " content="MediaWiki 1.18wmfl" /> 


1 See https://app.nihmaps.org. 

-See http://readidata.nitrd.gov/star/. 

3 See http://topics.cs.princeton.edu/Science/. 

4 For this example, we use the HTML representation of a Wikipedia article. This is because it’s easy to inspect on the web, 
isn't restricted by copyrights, and has many of the problems that web corpora have. For real applications, you should not use 
HTML served by Wikipedia’s web servers but instead download their XML dumps available at http://dumps.wikimedia.org. 
This will make your life easier (it lacks many of the problems that we address in this section) and will save both you and 
Wikipedia bandwidth. 
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Little in this raw format is what we would call a word, and being able to effectively use this as an 
input to topic models would require us to do substantial preprocessing. Once we remove extraneous 
material, we still have to determine what “words” we’re going to use and how to extract them from 
the remaining text. We go through each of these steps to produce a document in a form that is usable 
for topic modeling. 

Many times the files that comprise our corpus have extraneous information that do not add to 
the content of the data. With the Princess Ida example, HTML obscures what the underlying words 
are. We can remove them using a regular expression or a variety of text processing tools (e.g., using 
the Natural Language Toolkit (Bird et al., 2009)). 


Princess Ida - Wikipedia, the free encyclopedia Princess Ida From 
Wikipedia, the free encyclopedia Jump to: navigation , search 
Princess Ida; or, Castle Adamant is a comic opera with music by 
Arthur Sullivan and libretto by W. S. Gilbert. It was their eighth 
operatic collaboration of fourteen. 


Personal tools Log in / create account Namespaces Article Discussion 
Variants Views Read Edit View history Actions Search Navigation Main 
page Contents Featured content Current events Random article 
Donate to Wikipedia Interaction Help About Wikipedia Community por- 
tal Recent changes Contact Wikipedia Toolbox What links here Related 
changes Upload file Special pages Permanent link Cite this page 
Print/export Create a book Download as PDF Printable version 
Languages Fran\xc3\xa7ais Italiano This page was last modified on 23 
September 2011 at 23:59. Text is available under the Creative 
Commons Attribution-ShareAlike License; additional terms may apply. 
See Terms of use for details. Wikipedia&reg; is a registered trade- 
mark of the Wikimedia Foundation, Inc., a non-profit organization. 
Contact us Privacy policy About Wikipedia Disclaimers Mobile view 


Now that we’ve removed some of the HTML that obscured the content, we can see content 
that is ofter referred to as boilerplate: text that is repeated verbatim across many documents. Many 
forms of boilerplate (Freedman, 2007) text appears on this Wikipedia page. Some of it fulfills a 
legal function (“Text is available under the Creative Commons”), a navigation function (“Search 
Navigation”), and some of it provides metadata (“last modified on”). 

While these data are useful and necessary for an HTML page, they do not tell us about the 
content of the document, which is the goal of topic modeling. Failing to remove this boilerplate 
material can result in the discovery of topics that include just this boilerplate text. Because such text 
is on many pages, this is often a suboptimal result. 

Typically, boilerplate can be removed by heuristics (e.g., removal of the first or last N bytes), 
or failing that, methods that can discover boilerplate (Kohlschiitter et al., 2010). Such text can take 
many forms: signatures from prolific posters in a newsgroup, legalese in advertisements, contact 
information in press releases, or quotes appearing at the start of book chapters. 

Removing such boilerplate gives us: 
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Princess Ida; or, Castle Adamant is a comic opera with music by Arthur 
Sullivan and libretto by W. S. Gilbert. It was their eighth operatic 
collaboration of fourteen. Princess Ida opened at the Savoy Theatre on 
January 5, 1884, for a run of 246 performances. The piece concerns a 
princess who founds a women' s university and teaches that women are 
superior to men and should rule in their stead. The prince to whom she 
had been married in infancy sneaks into the university, together with 
two friends, with the aim of collecting his bride. They disguise them- 
selves as women students but are discovered, and all soon face a 
literal war between the sexes . 


which is finally getting us the content we want. Now we can begin extracting words from the text. 
Recall that most topic models treat documents as a bag-of-words, so we can stop caring about the 
order of the tokens within the text and concentrate on how many times a particular word appears in 
the text. 

With this in mind, below we show the sixty most frequent “words” sorted by frequency if we 
consider words to be anything delimited by whitespace. 
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Many of these strings are not what we would consider to be words but are instead punctuation. 
In most applications of topic modeling, we do not care about the punctuation used, so we likely want 
to remove them. Many of these words are also not content words; words like “the,” “and,” “of,” etc. 
are functional words that don’t provide any information about what the article is about. Such terms 
are typically called stopwords. 

In addition to including items that are not helping us understand what the document is about, 
we are also making distinctions between words that under most reasonable interpretations should be 
viewed as identical. For example, the words “Hilarion” and “Hilarion” are considered to be distinct. 
Similarly, “opera” and “Opera” are considered to be distinct. This suggests that we need to be more 
aggressive when separating words. 

On the other hand, there are also clues that we need to be less aggressive in separating words. 
For example, there are multi-word expressions that we might want to treat as pseudowords — e.g., 
“gilbert and sullivan” might be a reasonable multiword expression to treat as a fixed unit, as would 
“princess ida” and “king gama .” 5 

How do we address these issues? These problems are typically viewed as problems of stopword 
removal, normalization, tokenization, and collocation discovery. We discuss each of them in turn. 

Stopword Removal 

The most common way to remove words that do not contribute to the meaning of a document is to 
use a fixed list. Such lists are available in many languages and typically take care of most stopwords. 
However, such lists are not complete, and there are often corpus-specific stopwords that such lists 

5 There has been considerable interest in simultaneously discovering multiword expressions either after topic model- 
ing (Blei and Lafferty, 2009) or as part of the process for discovering topics (Johnson, 2010; Hardisty et al., 2010) . However, 
we view it as a preprocessing step (which is much more efficient). 
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would never discover. For example, in the Wikipedia corpus, “edit” or “citation” might appear so 
often in the HTML pages of Wikipedia that they do not serve to differentiate documents. 6 

Rather than having a set list of stopwords, other approaches take an adaptive threshold for which 
words are stopwords. For example, one could compute the tf-idf (Salton, 1968) of each term in a 
document and only consider terms that are above some reasonably set threshold. 

Normalization 

Here, we use normalization in a very broad sense. For a particular concept, there may be many 
different character strings that can represent it in a language. For instance, “Dog,” “Dogs,” “dog,” 
and “dogs” both refer to the same underlying concept, except that some are plural, and some are 
capitalized. For the purposes of topic modeling, we may wish to assume that these are actually the 
same word. Converting to lower case and applying a stemming algorithm (Porter, 1980) can convert 
all of these to a cannonicalform, “dog.” 

For languages with a richer morphology (Taghva et ah, 2005), this is particularly critical. Failing 
to do so can lead to an overly large vocabulary (which slows inference) and can lead to poorer 
topics, as identical words in slightly different syntactic contexts are treated as distinct. However, for 
English, this is more a matter of taste. When topics are designed for human inspection, many users 
prefer not to see stemmed words. 

Tokenization 

Tokenization (or segmentation) is the process for breaking a string of text into its constituent words. 
For English, whitespace is a good proxy for detecting word boundaries. However, it is not per- 
fect (as we saw above), and there may be other conventions for breaking a string of text into con- 
stituent words. For example, Treebank tokenization (Marcus et ah, 1993) separates “won’t” into 
“wo” and “n’t.” Other languages with implicit word boundaries may require more involved prepro- 
cessing (Goldwater et ah, 2006). 

Collocation Discovery 

Often, a word’s meaning is constrained by its local context (Schemann and Knight, 1995). For ex- 
ample, “house” means one thing, but when it appears together with “white house,” it means quite an- 
other. Discovering multi-word expressions is a common task in natural language processing (Man- 
ning and Schiitze, 1999). Often, topic modeling is done while ignoring multiword expressions. 

This can lead to suboptimal outcomes for a number of reasons. First, it can lead to topics that 
join together unrelated concepts. For example, by treating “soviet” and “union” as separate tokens, 
a topic model might group together documents on the soviet union and the civil war (Chang et ah, 
2009a) . Even when topic models don’t make such errors, it can annoy savvy users who see obvious 
multi-word expressions separated or displayed in the wrong order (e.g., displaying a topic as “bush,” 
“clinton,” “house,” and “white” as a topic). 

Let us now return to our Wikipedia article on Princess Ida, where we identified bigrams scored 
by point-wise mutual information (PMI), removed stopwords, and tokenized based on all punctua- 
tion and whitespace. We did not perform any normalization beyond converting everything to lower 
case. This gives a much more reasonable list of the most frequent words (seen in the following 
table). 

Note, however, that there are still some problems: “opera” and “operas” are still distinct, “d’oyly 
carte” was turned into “oyly carte”, and “edit” (a wikipedia-specific stopword) are still present. If 
we believe that these were problematic (or if we saw such issues in the output), we could apply a 

6 In practice, one should use Wikipedia XML dump, which would avoid some of these issues; again, we’re using the 
HTML version to give examples of some of the issues that might arise with web corpora. 
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stemming algorithm that would strip terminal “s” on plurals (at the risk of diminishing interpretabil- 
ity), improve tokenization (at the risk of allowing spurious punctuation to enter words), or add to 
our stop list (at the risk of removing real content-bearing contributions to documents). 

At this point, it’s often helpful to look at the most frequent words summed over all documents. 
This often gives you an idea of where problems might lie. If the results look reasonable, then you 
can press ahead with inference. 

12.1.3 Running Topic Models 

There are many different implementations of topic modeling software available; 7 each has (or 
should!) have its own discussion of how to specifically run the models and prepare input. The goal 
of this section is not to describe how to run any particular implementation but to talk about what 
needs to happen to go from raw data to an inferred topic model. 

Broadly, implementations fall into two general categories: those that use variational infer- 
ence (Blei et al., 2003) or Gibbs sampling (Griffiths and Steyvers, 2004)While describing these 
techniques is outside the scope of this chapter, they both attempt to discover the latent variables that 
best explain a dataset. 

Preparing Data 

After completing the steps in Section 12.1.2, the data must be converted into a form that is efficiently 
readable by software. This takes two steps: selecting the vocabulary and representing the data. 

Typically this is done by converting strings into integers (e.g., “opera” is 0, “princess _ida” is 
1). Typically you do not want to create an integer for every unique string that appears as a type 
in your corpus. It increases the amount of memory and time needed to run inference and can also 
introduce errors from misspellings or tokenization errors. Because natural languages have a power- 
law distribution, many types only appear in a handful of documents (or one). Including such types 
is useless for topic models, which attempt to generalize across documents. 

Next, the data are reduced to this integer form. There are two ways to do this: representing 
a document by a single array of integers, with each element in the array corresponding to one 
appearance of a word, or as two paired arrays a and b, where a[i\ represents the identity of a word 
and b\i] represents the frequency of the word in a document. The former is more common for 
inference using sampling; the latter is more common for variational inference. 

7 For most uses, we suggest Mallet, http://mallet.cs.umass.edu. For particularly large datasets, we suggest Yahoo 
LDA (Narayanamurthy, 201 1) or Mr. LDA (Zhai et al.. 2012). 
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FIGURE 12.1 

Training likelihood for variational inference. The shape of the curve shows that inference is increas- 
ing likelihood and is nearly converged. Other inference methods may have different convergence 
profiles, but it should have a similar shape. 


Initialization 

Both variational inference and Gibbs sampling can be viewed as a search over latent variables. 
Variational inference searches over variational parameters that induce a distribution over a model’s 
latent variables, and states in the Markov chain for Gibbs sampling are direct assignments of latent 
variables. Thus, in either case, models must be initialized. 

The most important aspect of initialization is to avoid local minima. Some initializations are 
‘good enough’ so that inference will not want to leave the initial state. One common example of this 
is initializing the variational distributions as uniform distributions; this is a local optimum and will 
not allow inference to improve upon the initialized state (with boring, identical, uniform topics). 

A better approach is to initialize randomly. In practice, this results in either perturbing the initial 
variational distributions from uniform slightly or, in a Gibbs sampler, setting topic assignments 
uniformly at random. 

Another approach is to initialize the state in a way that might give your algorithm a boost to 
speed convergence. For example, one could initialize a topic model by initializing each topic with 
a single document. For other models, other initializations are also possible, but it is important to 
be aware of the possibility of falling into a local minimum. If inference is working correctly, your 
model should not be that sensitive to initialization. 

Regardless of how you initialize your model and regardless of what inference technique you 
use, it’s important to have many multiple starting points to inference. This guards against problems 
of local optima and allows you to make better estimates about the stability of your inferred latent 
variables. 

Inference 

Running inference itself is the most important step in the process; it produces a learned model from 
raw data. If you’ve implemented inference yourself, it is also likely that this aspect has taken the 
most time. 

Typically, implementations work based on a series of iterations. Each iteration updates slightly 
the state of the algorithm, working slowly toward finding a local optimum. With each iteration, the 
model should estimate the data likelihood, i.e., given the current guess of the latent variables, what 
is the probability of recreating the data? 

You should watch this quantity closely. If the quantity is consistently going down, it probably 
means you have a bug. If the quantity is improving steadily, it is a good sign that inference is making 
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progress (although there could be other problems lurking underneath). It is difficult to say how many 
iterations are needed for inference; it depends on initialization, the data size, the complexity of the 
model, and what form of inference you’re using. However, once the likelihood converges to a value, 
it is usually a sign that your inference has converged (although this is not always a sure-fire indicator, 
particularly for MCMC (Neal, 1993)). 

Ready-made implementations should provide this information to you; even if you trust the code, 
you should still pay attention to verify that inference is progressing as it should. 

Hyperparameter Updating 

Hyperparameters in topic models are those that are not latent variables in the model but instead are 
the ‘most basic’ parameters of the topic model. Typically, these are the Dirichlet parameters that are 
assumed to have generated the per-document topic distributions and the per-topic distributions over 
words. More generally, these are any unknown parameters that govern latent variables (and are a 
part of any statistical model, not just topic models). 

Particularly if you’ve derived inference for the model yourself, it’s very tempting to set hyperpa- 
rameters and forget them. After all, you’re getting good results, the models are learning interesting 
things, and you’ve proved your point. At the risk of editorializing, we would encourage authors to 
explore sampling hyperparameters: 

• It is not that hard, both from the programmer’s perspective and from the amount of time it takes 
the computer; 

• If you’re using any kind of perplexity or likelihood-based evaluation, you will almost certainly 
lose to anything that does hyperparameter optimization (Wallach et al., 2009a) ; and 

• It will improve the (qualitative) quality of the results. 


12.1.4 Evaluation of Topic Models 

One of the most important features of topic modeling is that it does not require ‘supervision’ in 
the form of annotations. In addition to text documents, many text mining and NLP tasks require 
additional information such as document-level labels for classification, word-level labels for part-of- 
speech tagging, phrase-structure trees for parsing, and relevance judgments for information retrieval. 
With the exception of classification and translation, document creators do not naturally produce such 
labels, and hiring experts to add annotations can be expensive and time-consuming. In contrast, topic 
models require only a segmentation of documents into word tokens. They can therefore be applied 
quickly to large volumes of data. 

The benefit of supervised models, however, is that if we take the human-generated labels as a 
gold standard, measuring and comparing the performance of different methods is simple: we hold 
out a section of the labeled data as a testing set, train a model on the remaining data, and ask that 
model to predict labels for the testing set. If the predicted labels match the ‘true’ labels, the model is 
effectively learning the association between input data and output labels. In topic modeling, where 
the model is not trained to predict specific topics, there is no supervised gold standard. 

Finding patterns in data is the central goal of topic modeling, but in order to make scientific 
statements, we must also be able to make predictions about future observations. As an alternative 
to predicting annotations given previously unseen documents, we can attempt to predict the unseen 
documents themselves. Simply generating documents and comparing them to a held-out set, how- 
ever, is not feasible. In classification, there are a finite number of possible document labels. For a 
given testing document, even random guessing has a reasonably good probability of selecting the 
correct label. In contrast, the number of possible sequences of words from a vocabulary is exponen- 
tial in the length of the document. Therefore, rather than measuring accuracy or some rank-based 
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metric, we calculate the marginal probability of the held-out documents under the model. This met- 
ric measures the degree to which the model concentrates its probability mass on a relatively small 
set of ‘sensible’ documents rather than the vastly larger set of completely random documents. 

If, given some held-out document set w, some model A assigns greater marginal probability 
p(w| A) than some model B, we assume that model A has more effectively learned the language of 
the document set than model B. Model A is, in some sense, less ‘surprised’ by the real documents 
than model B. Borrowing a term from statistical language modeling, we refer to the negative log 
probability of the held-out set divided by the number of tokens — logp(w| A)/|w| as the perplexity 
of model A. 

Unfortunately, even measuring the marginal probability of a document under a topic model is not 
computationally tractable due to the exponentially large number of possible topic assignments for 
words. Good approximations, however, can be evaluated tractably (Wallach et al., 2009b; Buntine, 
2009). 

Although measurements of held-out probability are important, they are not, by themselves, suf- 
ficient. There are several common problems: 

• People use topic models to summarize the semantic components of a large document collec- 
tion, but good predictive power does not necessarily mean that a model provides a meaningful 
representation of concepts. 

• Users frequently distinguish between the quality of different topics: some are seen as coherent 
or pure, while others are seen as random or illogical. Marginal probability, however, depends on 
all topics, and therefore cannot be easily decomposed as a function of individual topics. 

• Calculations of marginal probability can be sensitive to hyperparameter settings. 


12.2 Problems 

The topic model is based on the simple assumption that documents contain multiple topics. But is 
this assumption valid? An article on salary caps in the NFL may be about sports and remuneration, 
but do those two topics account for every word written in that article? And is the bag-of-words 
assumption (that word order is irrelevant) valid? In topic models, every word in a document is 
probabilistically assigned a topic label, and therefore topics need to explain or account for all words 
that appear. Is this a reasonable assumption? 

Topic models are based on a generative model that clearly does not match the way humans write. 
However, topic models are often able to learn meaningful and sensible models. Of course, models 
are learned from the data — a collection of documents — so the quality of the model depends on the 
quality of the training data. 

Most evaluation of topic models has focused on statistical measures of perplexity or likelihood 
of test data. But this type of evaluation has limitations. The perplexity measure does not reflect the 
semantic coherence of individual topics learned by a topic model, nor does perplexity necessarily 
indicate how well a topic model will perform in some end-user task. Recent research has shown 
potential issues with perplexity as a measure — Chang et al. (2009b) suggests that human judgments 
can be contrary to perplexity measures. 

With this in mind, we pose the following overarching questions relating to evaluating topic 
models: 

Ql Are individual topics meaningful, interpretable, coherent, and useful? 

Q2 Are assignments of topics to documents meaningful, appropriate, and useful? 
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Q3 Do topics facilitate better or more efficient document search, navigation, understanding, brows- 
ing? 

While the final question is ultimately the most important for assessing the end-user utility of 
topic models, it is appropriate to address these questions in order. It doesn’t make sense to talk 
about the quality of assignments of topics to documents if one can’t agree on what a topic is about. 
Although topics themselves are not the end goal (the end goal is to use topics to improve some end- 
user task), the evaluation framework is built on the usability and usefulness of individual topics, and 
our focus in this chapter is primarily on the first of the three questions. 

12.2.1 Categories of Poor Quality Topics 

Before considering bad topics, it is helpful to consider what we are looking for in a topic. The 
following topic has several good, though not essential, properties: 


trout fish fly fishing water angler stream rod flies salmon... 


It is specific. There is a clear focus on words related to the sport of trout fishing. It is coherent. 
All of the words are likely to appear near one another in a document. Some words ( water, fly ) are 
ambiguous and may occur in other contexts, but they are appropriate for this context. It is concrete. 
We can picture the angler with his rod catching trout in the stream. It is informative. Someone 
unfamiliar with the topic can work from general words (fishing ) to learn about more unfamiliar 
words (angler). Relationships between entities can be inferred (trout and salmon both live in streams 
and can be caught in similar ways). 

There are a variety of ways topics can be “bad,” and we list some of them here. This value 
judgement is contextual: “good” or “bad” depends on a variety of factors that may involve the task, 
user, experience, etc. Here we take “bad” as some general idea of lack of usability, usefulness, 
utility, etc. 

General and Specific Words 

In any natural language, the most frequent words have less specific meaning, while rare words 
have very precise meanings. Stopwords such as the, and, of are the most extreme examples, but 
this gradient in specificity remains even after removing such words. For example, in a collection 
of publications from an artificial intelligence conference, words in the 99th percentile by token 
frequency might include algorithm, model, estimation. At the opposite end, there are large numbers 
of words that occur only once or twice, such as dopaminergic and phytoplankton. 


notion sense choice situation idea natural explicitly explicit definition refer... 


level significantly Jiigher significantly Jower lower higherJever measured significantly 
different investigate differ tended positive correlation significantly increased... 


might doesn’t fact anyone does isn’t mean anyway point quite... 


quite rather couple wasn’t far seems less three however point... 


Topic models often contain one or more topics consisting of frequent, non-specific words. Users 
perceive these topics as overly general and therefore not useful in understanding the divisions within 
a corpus. Such topics often consist of the most frequent words that were not removed as part of the 
stoplist. 
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Low-frequency words can also be problematic. Topics that contain many specific words are often 
perceived as unhelpful because they do not provide a general overview of the corpus. Such topics 
are also more vulnerable to random chance than topics containing more frequent words because 
they rely on words with small sample sizes. 

Mixed and Chained Topics 

Many topics are perceived as low quality by users because they are “mixed” or “chained.” 


zinc migraine veterans zn headache magnesium military war zn2 csd affairs episodic 
deficiency... 


A mixed topic can be defined as a set of words T = {wi,W2, W3, ..., w at} that do not make sense 
together, but that contains subsets <S-| . S?, ..., each of which individually form a sensible combination 
of words. For example, the words 


dog, cat, bird, honda, Chevrolet, bmw 


do not make sense together, but dog, cat, bird describe animals and honda, Chevrolet, bmw describe 
makes of cars. 

A chained topic is like a mixed topic: a set of words that is low quality overall but contains high 
quality subsets. The difference is that in a chained topic every high quality subset shares at least one 
word with another subset. For example, the set 


reagan, roosevelt, clinton, lincoln, honda, Chevrolet, bmw 


combines the names of U.S. presidents with makes of cars, but lincoln can be both categories. 
Chained topics can be caused by ambiguous words such as lincoln, but can also result from hi- 
erarchical relationships. A broader concept like tax may include several narrower concepts ( sales 
tax, property tax). These more specific individual words (sales, property ) may by themselves form 
non-sensical combinations. 

Identical Topics 

One common problem with the topic models learned on corpora is that the topics all look the same 
(or nearly so). Since topic models are meant to explain a corpus, having identical topics is clearly a 
suboptimal outcome. We discuss some of the possible causes of this outcome and how you can fix 
them. 


company customer market product business revenue companies software... 


market product company sale patent companies commercial cost... 


One reason that topics might appear to be identical is that the prior topic distribution is being 
observed. Normally, the prior distribution is combined with data to produce a posterior conditioned 
on that data. However, the prior is still a model of text even without data, and most implementations 
will happily provide the prior distribution as the “result,” even if it has not been supplied with data. 

This result might be of particular concern if the inference took a suspiciously short amount of 
time or if inference chose not to use some of the topics available to it. Both problems are relatively 
easy to fix — perhaps preprocessing created empty documents or too many topics were chosen. 
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Incomplete Stopword List 

In contrast, one of the symptoms of an incomplete stopword list is topics filled with highly frequent 
words (but the topics are not identical). Often, the topics discovered are perfectly reasonable, but 
buried underneath the convention of displaying the n most probable words in a topic. 


vii viii xiv xiii xii xvi xix xviii xvii xxix xxx xxi xxii xxiv xiii... 


david nick elizabeth brad kelsey ted drew theresa ricky russell... 


This is often resolved by adding the most frequent words in the topics to the stopword list and 
then rerunning inference. Alternatively, one could adopt models that have asymmetric priors (Wal- 
lach et al., 2009a) or explicitly model syntax (Griffiths et ah, 2005; Boyd-Graber and Blei, 2008). 


Nonsensical Topics 

Another possible problem is that the topics learned will be distinct, but otherwise inscrutable. This 
is often the result of preprocessing errors or providing the model with too much information. 


tree plum ink blossom chp branch bird paper... 


Remember that topic models discover words that often appear together in documents. If your 
“documents” evince a structure that has similar correlation patterns between “words,” it will gladly 
create a topic (we use scare quotes to highlight that the determination of what a document and word 
is is often subjective and is often impacted by preprocessing steps). 

For example, if some documents are created by optical character recognition (OCR), frequent 
OCR errors will likely occur together; this can create a topic of such errors. Similarly, if metadata 
are included in the specification of a document, this also might create topics to model this boilerplate 
material (e.g., as we did in Section 12.1.2). 


12.3 Diagnostics 

Now we have topics, but how do we know how good the topics are? Traditionally in the literature, 
measurements have focused on measures based on held-out likelihood (Blei et ah, 2003; Blei and 
Lafferty, 2005) or an external task that is independent of the topic space such as sentiment detec- 
tion (Titov and McDonald, 2008) or information retrieval (Wei and Croft, 2006). This is true even 
for models engineered to have semantically coherent topics (Boyd-Graber et ah, 2007). 

For models that use held-out likelihood, Wallach et ah (2009b) provides a summary of evalu- 
ation techniques. These metrics borrow tools from the language modeling community to measure 
how well the information learned from a corpus applies to unseen documents. These metrics gen- 
eralize easily and allow for likelihood-based comparisons of different models or selection of model 
parameters such as the number of topics. However, this adaptability comes at a cost: these methods 
only measure the probability of observations; the internal representation of the models is ignored. 

However, not measuring the internal representation of topic models is at odds with their presen- 
tation and development. Most topic modeling papers display qualitative assessments of the inferred 
topics or simply assert that topics are semantically meaningful, and practitioners use topics for 
model-checking during the development process. Hall et ah (2008), for example, used latent topics 
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deemed historically relevant to explore themes in the scientific literature. Even in production envi- 
ronments, topics are presented as themes: Rexa, 8 a scholarly publication search engine, displays the 
topics associated with documents. 

In this section, we focus on metrics that do pay attention to the underlying topics either by asking 
individuals directly or by measuring the properties of the discovered topics. 

12.3.1 Human Evaluation of Topics 

Chang et al. (2009b) presented the following task to evaluate the latent space of topic models. In the 
word intrusion task, the subject is presented with six randomly ordered words. The task of the user 
is to find the word which is out of place or does not belong with the others, i.e., the intruder. 

When the set of words minus the intruder makes sense together, then the subject should easily 
identify the intruder. For example, most people readily identify apple as the intruding word in the 
set: dog, cat, horse, apple, pig, cow because the remaining words: dog, cat, horse, pig, cow make 
sense together — they are all animals. For the set: car, teacher, platypus, agile, blue, Zaire, which 
lacks such coherence, identifying the intruder is difficult. People will typically choose an intruder 
at random, implying a topic with poor coherence. 

In order to construct a set to present to the subject, they select a topic from the model. They then 
select the five most probable words from that topic. In addition to these words, an intruder word 
is selected at random from a pool of words with low probability in the current topic (to reduce the 
possibility that the intruder comes from the same semantic group) but high probability in some other 
topic (to ensure that the intruder is not rejected outright due solely to rarity). All six words are then 
shuffled and presented to the subject. 

What Topics Make Sense? 

The word intrusion task was applied to two corpora: The New York Times (Sandhaus, 2008) and 
Wikipedia, 9 two real-world corpora that are viewed by millions of people each day. Figure 12.2 
shows the spectrum from incoherent to coherent topics. 

An additional finding was that there was not a clear association between traditional measures of 
topic models, such as held-out log-likelihood and more intuitive measures such as the word intrusion 
task. 

12.3.2 Topic Diagnostic Metrics 

While the techniques described in the previous section are useful, they are time consuming and rel- 
atively expensive. Are there ways to measure topic quality without relying on human judgments? 
Fortunately, there are several useful topic diagnostic metrics that depend only on statistics of in- 
dividual words in a topic without considering relationships between words or external knowledge 
sources. None of these metrics is conclusive by itself, but taken together they can provide a useful 
automated summary of topic quality. As a running example, we consider a model trained with 100 
topics on a corpus of political blogs from the 2008 U.S. presidential election. 

Topic Size 

Most topic model inference methods work by assigning the word tokens in a corpus to one of K 
topics. We can add up the number of tokens or fractions of a token assigned to a given topic to get 
a measure of topic size, where the unit is the number of word tokens. There is a strong relation 
between this measure of topic size and perceived topic quality: very small topics are frequently 

8 See http://rexa.info. 

9 See www.wikipedia.org. 
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FIGURE 12.2 

A histogram of the model precisions (the proportion of times users found the “intruder”) on the New 
York Times corpus evaluated on a 50-topic LDA model. Example topics are shown for several bins; 
the topics in bins with higher model precision evince a more coherent theme. 


bad (Talley et al., 2011). As an example, in the 2008 political blog model, the smallest topic by 
token count is http player video window flag script false scriptalreadyrequested www, with around 
6,500 words (most topics lie between 15,000 and 20,000). This topic appears to represent URLs for 
embedded videos. Although it is arguably interpretable, it is not the sort of conceptual topic that 
many users may be looking for. 

There are several possible explanations for this relationship. The most common topics in a cor- 
pus are usually well-represented in many documents. For the less-frequent topics, the model must 
estimate their word distribution from a smaller sample size. Smaller topics are also more vulnerable 
to become mixed with other topics because they do not ‘own’ their distribution as well. 

Word Length 

This metric measures the average length of the top N words in a topic. The usefulness of this metric 
varies by corpus, but in many cases it can be useful in picking up anomalous topics. The intuition is 
that words with more specific meaning tend to be longer, and vice versa. Examples include topics 
consisting of stopwords from a language other than the primary language of the corpus, and topics 
with many short acronyms, which are frequently ambiguous. In the political blog corpus, the topics 
with the smallest average word length are legislator usmc aye nc nyfl pa oh ca tx va (2.7 characters) 
and re ll exit don doesn ve isn didn maverick guy (4.15 characters). The legislator topic appears 
to represent abbreviations for U.S. states, perhaps related to legislative roll call voting. As with the 
previous metric, word length in this case does not necessarily indicate that a topic is uninterpretable, 
but it flags the fact that this is a different sort of cluster of co-occurring words. The re topic is more 
problematic, and indicates that there may be problems with tokenization of contractions such as 
you ’re or don 7, possibly due to differences in character encodings for the apostrophe. 

Distance from a Corpus Distribution 

A topic is a probability distribution over the vocabulary of a corpus. We can define a “global” topic 
by counting the number of times each word is used in all documents and normalizing those counts. 
Topic distributions that are similar to this corpus-level distribution according to some measure of 
similarity between distributions, such as Jensen-Shannon distance or Hellinger distance, consist of 
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the most common words in a corpus. These topics are often perceived as useless or overly gen- 
eral (AlSumait et al., 2009). The most common non-stopwords need to be assigned somewhere, 
so having a small number of these overly general topics may help to improve the quality of other 
topics, but it may not be necessary to display them to users. 

Distance from the corpus distribution is most useful for documents that contain formulaic or 
administrative language, such as grant proposals. In corpora focused on a particular issue, this metric 
may be less useful. The most frequent words in the 2008 political blog corpus are iraq war country 
states military security , indicating that the corpus is dominated by discussion of the Iraq war. The 
closest topic to this overall distribution is iraq troops war surge iraqi withdrawal security petraeus 
military forces, which is a useful, coherent topic. 

Difference between Token and Document Frequencies 

We typically rank words within a topic by the number of word tokens (or fractional tokens) of a 
particular type that have been observed in the topic. We can also rank words within a topic by the 
number of documents that contain at least one token of a particular type in that topic. The difference 
between the token-based distribution over words and this document-based distribution is useful 
in identifying words that are prominent in a topic due to the burstiness of words. When a corpus 
contains many long documents, it is common for a word that is specific to a single document to 
occur often enough in that one document that it appears in the list of N top words for a topic. The 
highest ranking topic according to this metric is the re topic mentioned previously, where the tokens 
re and ll are the most bursty, possibly reflecting occasional use of second-person pronouns. The 
metric can also detect outlier words in otherwise more usable topics. The second most bursty topic 
is financial crisis bailout fannie mortgage loans wall banks, where the term fannie (referring to the 
U.S. financial entity known as Fannie Mae) is the most bursty. 

Prominence within Documents 

Topics often represent the major themes of a document, but they can also be clusters of “method- 
ological” words, like words describing measurement ( larger, smaller ; fast ) or days of the week. A 
good method for distinguishing between important topics and these more functional topics is to 
examine the proportion of documents assigned to a topic. The names of months may occur many 
times in a corpus, and more consistently with each other than any other words, but no documents 
are dominated by month names in the way that a document might be about molecular biology or 
a political debate. This property can be defined mathematically in several ways. One method is to 
count the number of documents such that the estimated probability of topic k 6k is above some 
threshold, such as 0.2. Another is to count the number of documents for which topic k is the sin- 
gle largest topic. For example, the topic meeting official officials conference visit senior reported 
event friday is relatively large, with over 42,000 tokens, but it never appears as the single largest 
topic in any document. Meetings and conferences occur frequently, but are not by themselves worth 
discussing in great depth. In contrast, the topic franken coleman ballots minnesota votes recount al 
board counted has only a quarter of the total tokens of the meeting topic, but is the largest topic 
in 12.5% of the documents it appears in. This topic, about a contested senate election, refers to a 
specific event that is discussed in depth when it is discussed at all. 

Burstiness 

Many of the problems people observe in topic models are caused by the phenomenon of bursti- 
ness in natural language documents. This property states that within a context, for example a short 
document, there will be a small set of words that are globally rare but locally common. 

Burstiness is related to, but distinct from, well-known power law properties of natural language. 
If we construct a list of all the distinct words in a corpus of documents and record, for each word. 
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the number of documents that contain at least one instance of that word, the vast majority of those 
words will be rare, that is, they will occur in very few documents. The most common words, on the 
other hand, will make up roughly one half of the tokens in any given document. This relationship is 
known as Zipf’s law. 

Zipfian dynamics suggest that many of the words in a document will be rare, but burstiness 
describes an additional level of non-uniformity. It is not only likely that many of the tokens in 
a document will be rare, but it is also likely that many of them will be the same rare word. For 
example, assume you know the overall word-frequency statistics of a corpus. You can estimate the 
probability of every distinct word by dividing the number of occurrences of that word by the total 
number of tokens in the corpus C. If you know nothing about a certain document, these corpus- 
level frequencies provide a reasonable estimator of the probability that a randomly selected word 
from that document will be, for example, elephant. For most words this probability will be a small 
number p(w\C) = e. Once you have observed a particular word, however, the probability that the 
next word sampled at random from the same document will be of the same type is much larger than 
e. 

This phenomenon of burstiness violates the assumptions of a topic model, which assert that if 
we know the topic for a token position in a document, the probability that the word at that position 
is a particular type is independent of the document. When a topic is well represented in a corpus 
and most documents are short, the violations of this assumption may be averaged out. If there are 
long documents, however, the bursty words in those documents may have high prevalence in a topic 
despite not being representative of the central concept of a topic. Similarly, if a topic appears in only 
a few documents, each of which has its own bursty subset of the words that are associated with the 
topic, the topic may appear idiosyncratic or nonsensical. 

When confronted with a bursty corpus, it may be useful to filter your documents so that docu- 
ments are of similar length, perhaps by removing abnormally long documents or by breaking very 
long documents into smaller documents. It may also be worthwhile to consider particularly bursty 
words as stopwords to prevent them from dominating topics. 

12.3.3 Topic Coherence Metrics 

Our goal of answering whether individual topics are interpretable and coherent is partly addressed 
by the human evaluation of topics in Section 12.3.1. But how can we automatically measure topic 
coherence? And can we do this without disturbing the topic by adding intruder words? Earlier work 
presented an unsupervised approach to ranking topic significance and identifying what they call 
“junk” or “insignificant” topics (AlSumait et al., 2009). However, it was unclear to what extent their 
unsupervised approach and objective function agreed with human judgments, as they presented no 
user evaluations. 

Subsequent work demonstrated that it is possible to automatically measure topic coherence with 
near-human accuracy (Newman et al., 2010a;b) using a topic coherence score based on pointwise 
mutual information of pairs of terms taken from topics. In both Newman et al. (2010a) and Newman 
et al. (2010b), 6000 human evaluations are used to show that their coherence score broadly agrees 
with human-judged topic coherence. Similar approaches further confirmed that humans agree with 
word-pair based topic coherence metrics (Mimno et al., 201 1). 

Topic coherence metrics are motivated by measuring word association between pairs of words in 
the list of the top-10 most likely topic words (here, top-10 is chosen arbitrarily as the typical number 
of terms displayed to a user; other settings such as top-20 could work equally as well). The intuition 
is that a topic will likely be judged as coherent if pairs of words from that topic are associated. 
Devising word association measures is a long-studied problem in computational linguistics. We opt 
for co-occurrence-based metrics that use corpus aggregates of the number of times two words are 
seen in a document. There are two flavors of counting term co-occurrences; either using a sliding 
window of fixed size (e.g.. Do two terms appear in a window of 20 consecutive words?), or binarized 
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at the document level (e.g.. Does this document contain both these terms?). The former makes the 
metric more biased toward short-range dependencies. 

For a final twist, we could either use the training corpus to count term co-occurrences, or we 
could opt for an external corpus to obtain these counts. The former is certainly easier, but one 
may be concerned that the training corpus is not representative — or may be polluted by unusual 
termwise statistics — as may happen in a text collection of blogs or tweets. In this case, the external 
corpus could come from a variety of sources, for example the entire collection of English Wikipedia 
articles. 

Our topic coherence metrics take the form 

TC-f(w) = ^2f(wi,Wj), i,j e {1.. • 10}, (12.3) 

i<j 

where w = {w\,W 2 , ■ ■ ■ , who} are the top-10 most likely terms in a topic, and / is some function 
measuring the association between words Wi and Wj . 

Let N(wi, Wj) be the number of times word w-i and Wj co-appear in a sliding window of fixed 
width (say 20 terms), applied to every document in the corpus used to obtain co-occurrence counts. 
Furthermore, N(wi) is the total count of times Wi appeared in that sliding window. Let M{w j, Wj) 
be the number of distinct documents where words w t and Wj co-appear, and M (w , ) is the total 
number of distinct documents that include term W{ . We create different metrics by using N or 
M to convert counts to probabilities, using the appropriate normalization. Two obvious quantities 
are pointwise mutual information (PMI) and log conditional probability (LCP). Note that PMI is 
symmetric, whereas LCP is one-sided. 


PMI(rt)j, Wj) 
LCP (wi, Wj) 


log 

log 


P(Wj,Wj) 

p{ Wi )p{w j y 

P(Wi,Wj) 

P{Wj) 


Using these, we define the following three topic coherence metrics: 


(12.4) 

(12.5) 


TC-PMI(w) = PMI(t 0 j, Wj), (12.6) 

i<j 

TC-LCP(w) = ^LCP(it>i,u>j), (12.7) 

i<3 

TC-NZ(w) = ^2l[N(wi,Wj) = 0], (12.8) 

i<j 

where all sums are over i, j £ {1 . . . 10}. Note, we have added a third metric that simply counts 
the number of word pairs that are never observed in the reference corpus. These topic coherence 
metrics can be computed four different ways: using sliding window ( N ) or binarized (M) counts, 
obtained from training data or external data. For LCP, we can also do a symmetric metric instead of 
a one-sided metric by switching i < j for i y j. When N(wi, Wj) = 0, smoothing is required to 
compute a finite LCP, and for PMI we simply assume independence, PMI = 0. 

Using TC-PMI computed with a 20-word sliding window on the entire 3M articles in English 
Wikipedia, Newman et al. (2010a;b) compared computed topic coherence to 6000 human-judged 
coherence scores, and obtained a Spearman rank correlation of p = 0.8, approximately the same as 
the inter-rater correlation computed on a leave-one-out basis. This topic coherence metric was used 
by Lau et al. (2010) for their best topic word task, and it performed well at detecting Chang et alls 
intruder word (Chang et al., 2009b). 
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We conclude this section by showing how these three different topic coherence metrics differ. 
Here, we focus on the metrics’ ability to identify poor quality topics. We list sample topics learned 
from a collection of New York Times news articles, showing the lowest-scoring topics using the three 
metrics: 

TC-PMI 

why bad thing maybe doesn something does let isn really... 
self sense often history yet power seems become itself perhaps... 
came went told took later didn room began asked away... 
need better problem must enough does likely less whether... 

TC-LCP 

space Canadian station Canada nasa mission air shuttle crew hughes... 
fight lewis jones tyson vegas las boxing ring murphy elvis... 
ball body wright arms watson puerto club rico hands swing. . . 
blood thompson wilson cell test gladwin disease nixon gas sickle... 

TC-NZ 

eminem connor shea hanson mile daniels abbott seymour black trupia... 
porter amin burke olsen omar horse horses martinez ruettgers botai . . . 
hart hunter troy mack willis oxygen scooter terry chayes farrell... 
greene weber sims fashion fairchild malley fletcher crosby sawyer 
mccann . . . 

The above examples show how PMI, LCP, and NZ-based topic coherence metrics identify dif- 
ferent types of poor quality topics. TC-PMI tends to show poor quality topics that include terms 
that are more general and more frequent. TC-LCP shows topics that appear to relate to a name- 
able subject, but nevertheless are relatively incoherent. Finally, TC-NZ appears to do a good job at 
identifying the classic topic -of-names that is often learned by topic models. 


12.4 Improving Topic Models 

Now that we know what problems can appear in topic models and how to detect them, what can 
we do about them? At a high level, the problems can be interpreted as topics containing words that 
should not be together but are (e.g., “mixed” or “chained" topics) or distinct topics that should be 
together but aren’t. 

In this section, we discuss techniques to adapt the statistical formulating of topic models to 
incorporate these intuitive descriptions of problematic topics to create analysis of datasets that are 
more useful and more understandable. We also include a discussion on automatic topic labeling, 
another technique to improve the utility of topic models. 

12.4.1 Interactive Topic Models 

First, let’s begin with a common-use case: a frustrated consumer of topic models staring at a collec- 
tion of topics that do not make sense. In this section, we discuss interactive topic modeling (ITM), 
an in situ method for incorporating human knowledge into topic models . 10 

Recall that LDA views topics as distributions over words, and each document expresses an 

10 For full details, see Hu et al. (To Appear). 
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admixture of these topics. For “vanilla” LDA, these are symmetric Dirichlet distributions. A doc- 
ument is composed of observed words, which we call tokens, to distinguish specific observations 
from the word (type) associated with each token. Because LDA assumes a document’s tokens are in- 
terchangeable, it treats the document as a bag-of-words, ignoring potential relations between words. 

Constraints Change the Topics Discovered 

This problem with vanilla LDA can be solved by encoding constraints, which will ‘guide’ different 
words into the same topic. Constraints change the underlying distribution by forcing words to either 
be positively or negatively correlated with each other. If a user sees two words that should appear 
in the same topic but do not, they can impose a positive correlation between the words. If the user 
sees two words that appear in a topic together but should not, they can impose a negative correlation 
between the words. 

These correlations work by changing the underlying probabilistic model; while vanilla topic 
models assume that each topic is a disttibution over words, we use tree-structured topics (Boyd- 
Graber et ah, 2007; Andrzejewski et ah, 2009). These models instead assume that topics first have 
a disttibution over concepts and these concepts in turn have a disttibution over words. By encod- 
ing word distributions as a tree, we can preserve conjugacy and relatively simple inference while 
encouraging correlations between words that are grouped together in concepts. 

While these models can encourage words to be negatively or positively correlated, these con- 
straints on the model must be added interactively as the user sees problems that must be corrected. 

Interactively Adding Constraints 

Interactively changing constraints can be accommodated in ITM, smoothly transitioning from un- 
constrained LDA to constrained LDA with one constraint, to constrained LDA with two constraints, 
etc. 

A central tool that we use to transition between models is the strategic unassignment of states, 
which we call ablation (distinct from feature ablation in supervised learning). Gibbs sampling infer- 
ence stores the topic assignment of each token. In the implementation of a Gibbs sampler, unassign- 
ment is done by setting a token’s topic assignment to an invalid topic and decrementing any counts 
associated with that word. 

The constraints created by users implicitly signal that words in constraints don’t belong in a 
given topic. In other models, this input is sometimes used to ‘fix,’ i.e., deterministically hold con- 
stant topic assignments (Ramage et al., 2009). Instead, we change the underlying model, using the 
current topic assignments as a starting position for a new Markov chain with some states strategi- 
cally unassigned; this is equivalent to performing online inference (Yao et al., 2009). 

An alternative would be to not pursue this interactive strategy but instead restart inference from 
a new initialization. This, however, is counter to the goals of pursuing topic modeling interactively; 
restarting inference increases the latency users have to wait to see an updated model, restarting the 
model destroys any mental mapping of the model, and restarting the model could create additional 
problems into the model. 

Merging Topics 

To examine the viability of ITM, we begin with a qualitative demonstration that shows the potential 
usefulness of ITM. For this task, we used a corpus of about 2000 New York Times editorials from 
the years 1987 to 1996. We started by finding 20 initial topics with no constraints, as shown in 
Table 12.1 (left). 

Notice that Topics 1 and 20 both deal with Russia. Topic 20 seems to be about the Soviet Union, 
with Topic 1 about the post-Soviet years. We wanted to combine the two into a single topic, so we 
created a constraint with all of the clearly Russian or Soviet words ( boris , communist, gorbachev, 
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mikhail, russia, russian, soviet, union, yeltsin). Running inference forward 100 iterations with the 
Doc ablation strategy yields the topics in Table 12.1 (right). The two Russia topics were combined 
into Topic 20. This combination also pulled in other relevant words that were not near the top of 
either topic before: “moscow” and “relations.” Topic 1 is now more about elections in countries 
other than Russia. The other 18 topics changed little. 

While we combined the Russian topics, other researchers analyzing large corpora might preserve 
the Soviet vs. post-Soviet distinction but combine topics about American government. ITM allows 
tuning for specific tasks. 


Words 


election, yeltsin, russian, political, party, democratic, russia, president, 
democracy, boris, country, south, years, month, government, vote, since, 
leader, presidential, military 


Words 


election, democratic, south, country, president, party, africa, lead, even, 
democracy, leader, presidential, week, politics, minister, percent, voter, 
last, month, years 


2 


new, york, city, state, mayor, budget, giuliani, council, cuomo, gov, plan, 
year, rudolph, dinkins, lead, need, governor, legislature, pataki, david 


new, york, city, state, mayor, budget, council, giuliani, gov, cuomo, year, 
rudolph, dinkins, legislature, plan, david, governor, pataki, need, cut 


3 

4 


nuclear, arms, weapon, defense, treaty, missile, world, unite, yet, soviet, 
lead, secretary, would, control, korea, intelligence, test, nation, country, 
testing 

president, bush, administration, Clinton, american, force, reagan, war, 
unite, lead, economic, iraq, congress, america, iraqi, policy, aid, inter- 
national, military, see 


nuclear, arms, weapon, treaty, defense, war, missile, may, come, test, 
american, world, would, need, lead, get, join, yet, Clinton, nation 

president, administration, bush, clinton, war, unite, force, reagan, amer- 
ican, america, make, nation, military, iraq, iraqi, troops, international, 
country, yesterday, plan 


soviet, union, economic, reform, yeltsin, russian, lead, russia, gor- 
bachev, leaders, west, president, boris, moscow, europe, poland, 
mikhail, communist, power, relations 

TABLE 12.1 

Five topics from a 20-topic topic model on the editorials from the New York Times before adding 
a constraint (left) and after (right). After the constraint was added, which encouraged Russian and 
Soviet terms to be in the same topic, non-Russian terms gained increased prominence in Topic 1, 
and “Moscow” (which was not part of the constraint) appeared in Topic 20. 

However, user constraints are not absolute. For example, in experiments some users attempted 
to merge topics about Apple computers and IBM-compatible personal computers discovered from 
the 20 Newsgroups corpus. 11 However, the model preferred to explain the data using two separate 
topics. 

Separating Topics 

Another possible imperfection in a topic model is that a single topic conflates two concepts that 
should be in distinct topics. This can be corrected by adding a constraint that two words cannot 
appear in the same topic. For example, in a collection of biomedical publications, a topic might 
be discovered that contains both words related to spinal cord and the urinary tract. Upon showing 
this to a domain expert — an NIH program manager — it was found that this was incorrect clustering. 
Introducing a constraint that these two words should not appear together results in the new topics in 
Table 12.2. 


soviet, lead, gorbachev, union, west, mikhail, reform, change, europe, 
leaders, poland, communist, know, old, right, human, Washington, west- 
ern, bring, party 


12.4.2 Generalized Polya Urn Models 

A topic model claims that, given topic assignments, the observed words are selected i.i.d. from a 
single set of topic distributions. If this assumption is true, then the expected number of documents 
that contain any pair of words wu Wj assigned to topic k should be a function of p(wi\k) and 
p(wj\k). Under this model, if those two probabilities are both large, it is unlikely that there will be 
no documents containing both words. Several of the topic quality metrics described in this chapter 

1 1 See http://people.csail.mit.edu/jrennie/20Newsgroups/. 
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Before 


After 


bladder 

spinaLcord 

sci 

spinal_cord_injury 
spinal 
urinary 
urothelial 
cervical 
injury 
recovery 
urinary .tract 
locomotor 
lumbar 
reflex 


spinaLcord 
spinaLcord .injury 
spinal 
injury 
recovery 
motor 
reflex 
urothelial 
injured 

functional .recovery 
plasticity 
locomotor 
cervical 
pathways 


bladder 

women 

oc 

pelvic .floor 
incontinence 
urinary .incontinence 
pelvic 
ui 

prolapse 

ul 

contraceptive 

treatment 

stress 

disorders 


TABLE 12.2 

Example of a topic being split using interactive topic modeling under the constraint that “blad- 
der” and “spinal.cordJnjury” should not be in the same topic. This results in “bladder” now being 
associated with incontinence. 


measure mismatch between the theoretical co-occurrence implied by a model and actual word co- 
occurrence observed in documents. 

The power that these simple metrics hold raises the question of why such topics should arise 
in the first place: if they are so easy to detect, why do they appear at all? The answer is that under 
standard specifications, topic models such as LDA cannot directly represent co-occurrence informa- 
tion. Mimno et al. (201 1) presents an alternative model based on generalized Polya urns (Mahmoud, 
2008) that addresses this problem by encoding word co-occurrence information into the prior. 

The generative process of a topic model is usually described in terms of discrete variables drawn 
from multinomial distributions that are themselves drawn from Dirichlet distributions. In this rep- 
resentation, the “meaning” of a topic is defined once and for all when the multinomial parameters 
for the topic-word distribution are sampled, and does not change no matter how many words are 
observed. An alternative generative model for LDA, which does not involve these intermediate 
multinomial parameters, is a standard Polya urn process. Under this representation, the “meaning” 
of a topic evolves as words are sampled. 

Consider an urn containing N balls, each with a single word written on it, such that N w balls 
have word w written on them. If we draw and replace balls repeatedly, recording the word on 
each sampled ball, the frequency of each word in the resulting set of words is a distributed i.i.d. 
multinomial with p(w) oc N w . 

If instead of replacing just the sampled ball we also add a new ball with the same word, the 
resulting set of words is distributed as a Dirichlet-compound multinomial. The DCM distribution 
is equivalent to a Dirichlet-multinomial hierarchical model with the parameters of the multinomial 
distribution integrated out. This model, the standard Polya urn, is not i.i.d.: if we draw a ball with 
word w at time t, the probability that word w will appear on the next ball at t + 1 increases and the 
probability of all other words decreases. The model is, however, exchangeable, as the probability of 
a sequence of words is invariant to permutation of their order. 

The Polya urn process provides burstiness (a word, once seen, becomes more probable), but 
it cannot represent covariance since an increase in one word decreases the probability of all other 
words. The generalized Polya urn extends the standard urn model by specifying a separate rule for 
adding new balls after sampling a ball of each type. For example, we might say that after sampling 
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a ball with word W 2 , we should replace it along with two new balls with word W 2 , and one each of 

u> 5 , ws, and W15. In this way, W2 would increase the probability of seeing W2 again, but also increase 

the probability of the three other word types. 

All three urn models can be represented by specifying a schema matrix A , which defines the 
number of balls of each type to add after drawing a ball of each type. To define the simple sampling - 
with-replacement model we use a matrix of all zeros, indicating that no new balls will be added. For 
the standard Polya urn, we use an identity matrix, which specifies that after seeing a ball of type w 
we add a single new ball of type w and nothing else. The generalized Polya urn permits arbitrary 
values in the matrix (negative values are possible, corresponding to permanently removing balls, 
but can lead to instability). Mimno et al. (2011) defines a matrix with entries proportional to the 
co-document matrix used in the previously discussed evaluation metrics. 

A vv oc A v D(v), (12.9) 

A vw oc A v D(w,v). 

As with the standard Polya urn, the flexibility of the generalized Polya urn comes at the cost 
of additional complexity. Specifically, the resulting distribution is no longer exchangeable, as the 
probability of a sequence depends on the order that words are observed. Nevertheless, the model 
can be effectively trained using a Gibbs sampler as if the distribution were exchangeable. 

12.4.3 Regularized Topic Models 

Topic models have the potential to improve search and discovery by extracting useful semantic 
themes from text documents. When learned topics are coherent and interpretable, they can be valu- 
able for faceted browse, results set diversity, and document retrieval. However, when collections are 
made up of short documents or noisy text (e.g., web search result snippets or blog posts), learned 
topics can be less coherent, less interpretable, and less useful. 

Predicated on recent evidence that a PMI-based topic coherence score is highly correlated 
with human-judged topic coherence (Newman et al., 2010a), Newman et al. (2011) proposed two 
Bayesian regularization formulations to improve topic coherence. Both methods use additional word 
co-occurrence data to improve the coherence and interpretability of learned topics, while still learn- 
ing a faithful representation of the collection of interest, as measured by likelihood of test data. 
These regularized topic models are an alternative to the generalized Polya urn models described in 
the previous section, and have similar objectives and goals. 

To learn more coherent topic models for small or noisy collections, they introduced structured 
priors on <p t based upon external data, which have a regularization effect on the standard LDA 
model. More specifically, the priors on <p t depend on the structural relations of the words in the vo- 
cabulary as given by external data, which are characterized by the W x W “covariance” matrix C. 
Intuitively, C is a matrix that captures the short-range dependencies between (i.e., co-occurrences 
of) words in the external data. One is only interested in relatively frequent terms from the vocabu- 
lary, so C is a sparse matrix and computations are still feasible. 

Quadratic Regularizer. A standard quadratic form is used with a trade-off factor. Given a matrix 
of word dependencies C, use the prior: 

p(<t>t\C) (X (c/>JC <t> t y (12.10) 

for some power v. The normalization factor is unknown but for MAP estimation we do not need it. 
Optimizing the log posterior with respect to <j> w \ t subject to the usual constraints, one obtains the 
following fixed point update: 
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4>w\t •<- 


l 

N t + 2v 



<t>w\t YaLi C iwfi\t 

tfCcft 


2v 


( 12 . 11 ) 


Unlike other topic models in which a covariance or correlation structure is used in the context of 
correlated priors for 0 t \ d , (as in the correlated topic model (Blei and Lafferty, 2005)), this method 
does not require the inversion of C, which would be impractical for even modest vocabulary sizes. 
(Interactive topic modeling, discussed in Section 12.4.1, also adds correlations without requiring 
this inversion because it preserves conjugacy.) 

By using the update in Equation (12.1 1) we obtain the values for <f> w \ t . This means we no longer 
have conjugate priors for <j) w \ t and thus the standard Gibbs-sample update 

p{z id = t\x id = w, iT l ) oc ( N td + a ) ( 12 . 12 ) 

does not hold. Instead, at the end of each major Gibbs cycle, <j> w \ t is re-estimated and the corre- 
sponding Gibbs update becomes: 


p(z id = t\xid = w, zT\ cj) w \ t ) oc <f> w \ t (Nf d l + a ) . 


(12.13) 


Convolved Dirichlet Regularizer. Another approach to leveraging information on word dependen- 
cies from external data is to consider that each <p t is a mixture of word probabilities ip t , where the 
coefficients are constrained by the word-pair dependency matrix C: 

4>t oc Ct l> f where tp t ~ Dirichlet ( 7 I). (12.14) 

Each topic has a different if> t drawn from a Dirichlet, thus the model is a convolved Dirichlet. 
This means that we convolve the supplied topic to include a spread of related words. Optimizing the 
posterior and solving for ij) w \ t one obtains: 

^ i\r. p . 

i>w\t oc ** lW . + T- (12.15) 

i- 1 Yj = l j\t 

One follows the same semi-collapsed inference procedure used for the quadratic regularizer, 
with the updates in Equations (12.15) and (12.14) producing the values for cj> w \ t to be used in the 
semi-collapsed sampler (12.13). 

Using thirteen datasets from blog posts, news articles, and web searches, Newman et al. (201 1) 
shows that both regularizers improve topic coherence and interpretability while learning a faithful 
representation of the collection of interest. Additionally, in an experiment involving 3,650 crowd- 
sourced topic comparisons, they show that humans judge the regularized topic models as being more 
coherent than LDA. 

12.4.4 Automatic Topic Labeling 

In user-facing applications that use topic models, topics are displayed to humans, typically using 
the top-10 or so terms in the topic. However, it can sometimes be difficult for end-users to interpret 
the rich statistical information encoded in the topics, or quickly getting the gist of a topic. One way 
of making topics more readily human interpretable is by annotating the topic with a short label. 
While this task is best done by a subject matter expert, recent work has shown that one can partially 
automate the generation of candidate labels for topics. 

Short labels for topics are typically best expressed with multiword terms (for example STOCK 
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MARKET TRADING), or terms that might not be in the top-10 topic terms (for example, COLORS 
would be a good label for a topic of the form red green blue cyan ...). Lau et al. (2011) proposed a 
novel method for automatic topic labeling that first generates topic label candidates using English 
Wikipedia, and then ranks the candidates to select the best topic labels. Given the size and diversity 
of English Wikipedia, they posit that the vast majority of (coherent) topics or concepts are probably 
well represented by a Wikipedia article title. 

Their method of predicting suitable candidate labels has two parts. They first have a system 
to generate a relatively long list of candidates. Then, they use lexical features of and association 
measures between candidate labels and topic terms in a support vector regression framework for 
ranking the labels. 

Generating the list of candidates starts with querying Wikipedia using the top-10 topic terms. 
The top-ranked search results (article titles) returned constitute the initial set of primary candidates 
for each topic. Next we chunk parse the primary candidates using the OpenNLP chunker and extract 
out all noun chunks. For each noun chunk, we generate all component n-grams, out of which we 
remove all n-grams which are not in themselves article titles in English Wikipedia. For example, 
if the Wikipedia document title were the single noun chunk United States Constitution, we would 
generate the bigrams United States and States Constitution, and prune the latter; we would also 
generate the unigrams United, States, and Constitution, all of which exist as Wikipedia articles and 
are preserved. 

Ranking candidate labels is premised on the idea that a good label should be strongly associ- 
ated with the topic terms. To learn the association of a label candidate with the topic terms, we use 
several lexical association measures: pointwise mutual information (PMI), Student’s i-test. Dice’s 
coefficient, Pearson’s \ 2 test, and the log-likelihood ratio. We also include conditional probability 
and reverse conditional probability measures based on the work of Lau et al. (2010). To calculate 
the association measures, we parse the full collection of English Wikipedia articles using a sliding 
window of width 20, and obtain term frequencies for the label candidates and topic terms. To mea- 
sure the association between a label candidate and a list of topic terms, we average the scores of the 
top- 10 topic terms. 

These lexical features and association measures were used in a supervised model by training 
over topics where we have gold standard labeling of the label candidates using a support vector 
regression (SVR) model over all of the features. Table 12.3 shows examples of the top-ranked label 
candidate for four topics learned on four different corpora from diverse genres. We see that the 
top-ranked label candidate does a relatively good job of capturing the gist of each of the four topics. 

china Chinese Olympics gold Olympic team win beijing medal sport ... 

Label: 2008 SUMMER OLYMPICS 

church arch wall building window gothic nave side vault tower ... 

Label: GOTHIC ARCHITECTURE 

israel peace barak israeli minister Palestinian agreement prime leader ... 

Label: Israeli-Palestinian Conflict 

cell response immune lymphocyte antigen cytokine t-cell induce receptor ... 

Label: Immune System 

TABLE 12.3 

A sample of topics and automatically generated topic labels. 



250 


Handbook of Mixed Membership Models and Its Applications 


12.5 Conclusion 

While topic models are a popular technique for understanding large datasets, how to actually go 
from raw data to an effective topic analysis is often difficult for new users. This chapter discusses 
the iterative process for building topic models from preprocessing data to improving and under- 
standing the results users can obtain from models. In time, this process can benefit from continued 
development by both tool builders and researchers. 

However, tool builders will continue to make this process more straightforward by building 
unified interfaces that can seamlessly adjust tokenization, vocabulary, and topic models within a 
single interface. Improved visualization tools that can help users identify and correct topic modeling 
errors would also make the process of curating a topic model more straightwforward. 

Researchers can improve the process by building models that are less sensitive to the seemingly 
arbitrary choices made by users. Models should be less sensitive to the vocabulary, should be able 
to segment overly long documents, and should detect when the data fail to meet the assumptions of 
topic models, such as when a corpus is in multiple languages or dialects. Finally, researchers can 
improve inference throughput and latency so that users can try more models more quickly. 

Together, these advances will allow users to move from data to a final, quality model quickly 
and with minimal hassle. 
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Identifying latent groups of entities from observed interactions between pairs of entities is a fre- 
quently encountered problem in areas like analysis of protein interactions and social networks. We 
present a model that combines aspects of mixed membership stochastic blockmodels and topic mod- 
els to improve entity-entity link modeling by jointly modeling links and text about the entities that 
are linked. We apply the model to two datasets: a protein-protein interaction (PPI) dataset supple- 
mented with a corpus of abstracts of scientific publications annotated with the proteins in the PPI 
dataset and an Enron email corpus. The induced topics’ ability to help understand the nature of the 
data provides a qualitative evaluation of the model. Quantitative evaluation shows improvements 
in functional category prediction of proteins and in perplexity, using the joint model over baselines 
that use only link or text information. For the PPI dataset, the topic coherence of the emergent topics 
and the ability of the model to retrieve relevant scientific articles and proteins related to the topic 
are compared to that of a text-only approach that does not make use of the protein-protein inter- 
action matrix. Evaluation of the results by biologists show that the joint modeling results in better 
topic coherence and improves retrieval performance in the task of identifying top related papers and 
proteins. 


255 



256 


Handbook of Mixed Membership Models and Its Applications 


13.1 Introduction 

The task of modeling latent groups of entities from observed interactions is a commonly encoun- 
tered problem. In social networks, for instance, we might want to identify sub -communities. In the 
biological domain we might want to discover latent groups of proteins based on observed pairwise 
interactions. Mixed membership stochastic blockmodels (MMSB) (Airoldi et al., 2008; Parkkinen 
et al., 2009) approach this problem by assuming that nodes in a graph represent entities belonging to 
latent blocks with mixed membership, effectively capturing the notion that entities may arise from 
different sources and have different roles. 

In another area of active research, models like latent Dirichlet allocation (LDA) (Blei et al., 
2003) model text documents in a corpus as arising from mixtures of latent topics. In such mod- 
els, words in a document are potentially generated from different topics using topic-specific word 
distributions. Extensions to LDA (Erosheva et al., 2004; Griffiths and Steyvers, 2004) additionally 
model other metadata in documents such as authors and entities by treating a latent topic as a set of 
distributions, one for each metadata type. For instance, when modeling scientific publications from 
the biological domain, a latent topic could have a word distribution, an author distribution, and a 
protein entity distribution. We refer to this model as Link LDA following the convention established 
by Nallapati et al. (2008). The different types of data that are contained in a document (e.g., words 
in the body, words in the title, authors, list of citations, etc.) are referred to as entity types. 

In this chapter, we present a model, Block-LDA, that jointly generates text documents annotated 
with metadata about associated entities and external links between pairs of entities. This allows 
the model to use supplementary annotated text to influence and improve link modeling. The text 
documents are modeled as bags of entities of different types and the network is modeled as edges 
between entities of a source type to a destination type. Consider the example of a corpus of pub- 
lications about the yeast organism and a network of protein-protein interactions in yeast. These 
publications are further annotated by experts with lists of proteins that are discussed in them. There- 
fore, each publication could be modeled as a collection of bags vis a vis bag of body -words, bag of 
authors, bag of proteins discussed in the paper, etc. Similarly, the network could be a collection of 
protein-protein interactions independently observed. The model merges the idea of latent topics in 
topic models with blocks in stochastic blockmodels. The joint modeling permits sharing of informa- 
tion about the latent topics between the network structure and text, resulting in more coherent topics. 
Co-occurrence patterns in entities and words related to them aid the modeling of links in the graph. 
Likewise, entity-entity links provide clues about topics in the text. We also propose a method to 
perform approximate inference in the model using a collapsed Gibbs sampler, since exact inference 
in the joint model is intractable. 

We then use the model to organize a large collection of literature about yeast biology to enable 
topic-oriented browsing and retrieval from the literature. The analysis is performed using the mixed 
membership topic modeling to uncover latent structure in document corpora by identifying broad 
topics that are discussed in it. This approach complements traditional information retrieval tasks 
where the objective is to fulfill very specific information needs. By using joing modeling, we are 
able to use other sources of domain information related to the domain in addition to literature. In 
the case of yeast biology, an example of such a resource is a database of known protein-protein 
interactions (PPI) which have been identified using wetlab experiments. We perform data fusion 
by combining text information from articles and the database of yeast protein-protein interactions 
by using a latent variable model — Block-LDA (Balasubramanyan and Cohen, 2011), that jointly 
models the literature and PPI networks. 

We evaluate the ability of the topic models to return meaningful topics by inspecting the top 
papers and proteins that pertain to them. We compare the performance of the joint model, i.e., 
Block-LDA, with a model that only considers the text corpora by asking a yeast biologist to 
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evaluate the coherence of topics and the relevance of the retrieved articles and proteins. This eval- 
uation serves to test the utility of Block-LDA on a real task as opposed to an internal evaluation 
(such as by using perplexity metrics). Our evaluaton shows that the joint model outperforms the 
text-only approach both in topic coherence and in top paper and protein retrieval as measured by 
precision® 10 values. 

The chapter is organized as follows: Section 15.2 introduces the model and presents a Gibbs 
sampling-based method for performing approximate inference with the model. Section 13.3 dis- 
cusses related work, and Section 13.4 provides details of datasets used in the experiments. Sections 
13.5.1 and 13.5.2 present the results of our experiments on two datasets from different domains. 
Finally, our conclusions are in Section 13.6. 


13.2 Block-LDA 

Variables in the model 


K 
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the number of topics (therefore resulting in I\ 2 blocks in the network) 

Dirichlet prior for the topic pair distribution for links 

Dirichlet prior for document specific topic distributions 

Dirichlet prior for topic multinomials 

multinomial distribution over topic pairs for links 

multinomial distribution over topics for document d 

the number of types of entities in the corpus 

multinomial over entities of type t for topic z 

number of documents in the corpus 

topic chosen for the /-th entity of type t in a document 

the i-th entity of type / occurring in a document 

number of links in the network 

topics chosen for the two nodes participating in the i-th link 
the two nodes participating in the 7-th link 


The Block-LDA model (plate diagram in Figure 13.1) enables sharing of information between the 
component on the left that models links between pairs of entities represented as edges in a graph 
with a block structure, and the component on the right that models text documents through shared 
latent topics. More specifically, the distribution over the entities of the type that are linked is shared 
between the blockmodel and the text model. 

The component on the right, which is an extension of the LDA models, documents as sets of 
“bags of entities,” with each bag corresponding to a particular type of entity. Every entity type has a 
topic-wise multinomial distribution over the set of entities that can occur as an instance of the entity 
type. 

The component on the left is a generative model for graphs representing entity-entity links with 
an underlying block structure, derived from the sparse blockmodel introduced by Parkkinen et al. 
(2009). Linked entities are generated from topic-specific entity distributions conditioned on the topic 
pairs sampled for the edges. Topic pairs for edges (links) are drawn from a multinomial defined over 
the Cartesian product of the topic set with itself. Vertices in the graph representing entities therefore 
have mixed memberships in topics. In contrast to MMSB, only observed links are sampled, making 
this model suitable for sparse graphs. 

Let K be the number of latent topics (blocks) we wish to recover. Assuming documents consist 
of T different types of entities (i.e., each document contains T bags of entities), and that links in the 
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FIGURE 13.1 

Block-LDA: plate diagram. 

graph are between entities of type ti and t r , the generative process is as follows. 

1. Generate topics: 

• For each type t £ 1, . . . , T, and topic z £ 1, . . . , K, sample /3 tj2 ~ Dirichlet( 7 ), the topic 
specific entity distribution. 

2. Generate documents. For every document d £ { 1 . . . I)}: 

• Sample 6d ~ Dirich let (o /j) where 0,i is the topic mixing distribution for the document. 

• For each type t and its associated set of entity mentions et,i, i £ {1, • ■ ■ , 

- Sample a topic z t .i ~ Multinomial^)- 

- Sample an entity e t l ~ Multinomial(/?t i2t . ). 

3. Generate the link matrix of entities of type tf. 

• Sample 77 , ~ Dirich let ( cv ) where ttl describes a distribution over the Cartesian product 
of topics for links in the dataset. 

• For every link en — > e i2 , i £ {1 • ■ • N L }: 

- Sample a topic pair (zn, Z{f) ~ Multinomial (it if). 

- Sample en ~ Multinomial(/3 ti;2il ). 

- Sample ea ~ Multinomial^t^^). 

Note that unlike the MMSB model introduced by Airoldi et al. (2008), this model generates only 
realized links between entities. 
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Given the hyperparameters old, oil , and 7, the joint distribution over the documents, links, their 
topic distributions, and topic assignments is given by 


p(tr L , 0 , / 3 , z, e, (zi, z 2 ), (ei, e 2 ) \a D ,a L ,i) oc 

K T 

nn^> x 

Z=1 t= 1 

D T N d , t (d) 

n°ir(^)nn^c« x 

d=l t=l i=i ’ *' 1 


Dir(7r L |a L ) 


n 


Jzn,Zi 2 ) oe 


PZlPt 


i= 1 


( 13 . 1 ) 


A commonly required operation when using models like Block-LDA is to perform inference on 
the model to query the topic distributions and the topic assignments of documents and links. Due to 
the intractability of exact inference in the Block-LDA model, a collapsed Gibbs sampler is used to 
perform approximate inference. It samples a latent topic for an entity mention of type t in the text 
corpus conditioned on the assignments to all other entity mentions using the following expression 
(after collapsing 9d)'- 


p(z t ,i = z\e t ,i,z _,1 ,e“’ 1 ,a £ ),7) 
N n zte t i + 7 




OLD , 


Ee 




( 13 . 2 ) 


Similarly, we sample a topic pair for every link conditional on topic pair assignments to all other 
links after collapsing tt/, using the expression: 


p { zi = (21, 2 2 )|(eii, ej 2 ), z -11 , (ei,e 2 )“' 1 ,ai,7) ( 13 . 3 ) 

a { n (zuz2) + Q i) x 

( n ^lt i e jl +T , )( n r 2 Ve i2 + 7 ) 

(Ee n zlt,e + \ E n L)(E e «2 2 t r e + l E *r It) ‘ 

E t refers to the set of all entities of type t . The ns refer to the number of topic assignments in the 
data. 


• ti z te — the number of times an entity e of type t is observed under topic 2. 

• n z d — the number of entities (of any type) with topic 2 in document d. 

• n^ Zi z ^ — count of links assigned to topic pair (21, 2 2 ). 

The topic multinomial parameters and the topic distributions of links and documents are easily 
recovered using their MAP estimates after inference using the counts of observations: 


0g = 

n zte + 7 

Ee' n z te' + \E t \l' 

( 13 . 4 ) 


ridz + old 

( 13 . 5 ) 
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( 13 . 6 ) 


A de-noised form of the entity-entity link matrix can also be recovered from the estimated 
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parameters of the model. Let B t be a matrix of dimensions K x \E t \ where row k = k £ 
{1, • • • , A'}. Let Z be a matrix of dimensions AT x AT s.t Z Piq = l ( z n = P, z i2 = q). The 
de-noised matrix M of the strength of association between the entities in E tl is given by M = 
BlZB tr . 


13.3 Related Work 

Link LDA, and many other extensions to LDA, model documents that are annotated with metadata. 
In a parallel area of research, various different approaches to modeling links between documents 
have been explored. For instance, pairwise-link-LDA (Nallapati et ah, 2008) combines MMSB with 
LDA by modeling documents using LDA and generating links between them using MMSB. The 
relational topic model (Chang and Blei, 2009) generates links between documents based on their 
topic distributions. The copycat and citation influence models (Dietz et ah, 2007) also model links 
between citing and cited documents by extending LDA and eliminating independence between doc- 
uments. The latent topic hypertext model (LTHM) (Gruber et ah, 2008) presents a generative process 
for documents that can be linked to each other from specific words in the citing document. These 
classes of models are different from the model proposed in this paper, Block-LDA, in that they 
model links between entities in the documents rather than links between documents. 

The Nubbi model (Chang et ah, 2009) tackles a related problem where entity relations are dis- 
covered from text data by relying on words that appear in the context of entities and entity pairs 
in the text. Block-LDA differs from Nubbi in that it models a document as bags of entities without 
considering the location of entity mentions in the text. The entities need not even be mentioned in 
the text of the document. The group-topic model (Wang et ah, 2006) addresses the task of modeling 
events pertaining to pairs of entities with textual attributes that annotate the event. The text in this 
model is associated with events, which differs from the standalone documents mentioning entities 
considered by Block-LDA. 

The author-topic model (AT) (Rosen-Zvi et ah, 2004) addresses the task of modeling corpora 
annotated with the IDs of people who authored the documents. Every author in the corpus has a 
topic distribution over the latent topics, and words in the documents are drawn from topics drawn 
from the specific distribution of the author who is deemed to have generated the word. The author- 
recipient-topic model (ART) (McCallum et ah, 2005) extends the idea further by building a topic 
distribution for every author-recipient pair. As we show in the experiments below, Block-LDA can 
also be used to model the relationships between authors, recipients, and words in documents by 
constructing an appropriate link matrix from known information about the authors and recipients 
of documents; however, unlike the AT and ART models which are primarily designed to model 
documents, Block-LDA provides a generative model for the links between authors and recipients 
in addition to documents. This allows Block-LDA to be used for additional inferences not possible 
with the AT or ART models, for instance, predicting probable author-recipient interactions. Wen and 
Lin (2010) describes an application of an approach that uses both content and network information 
to analyze enterprise data. While a joint modeling of the network and content is not used, LDA is 
used to study the topics in communications between people. 

A summary of related models from prior work is shown in Table 13.1. 

The Munich Institute for Protein Sequencing (MIPS) database (Mewes et al., 2004) includes a 
hand-crafted collection of protein interactions covering 8000 protein complex associations in yeast. 
We use a subset of this collection containing 844 proteins, for which all interactions were hand- 
curated (Figure 13.2(a)). The MIPS institute also provides a set of functional annotations for each 
protein which are organized in a tree, with 15 nodes at the first level (shown in Table 13.2). The 844 
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Model 

Links 

Documents 

LDA 

- 

words 

link LDA 

- 

words + entities 

relational topic model 

document-document 

words + document IDs 

pairwise-link-LDA. link-PLSA- 
LDA 

document-document 

words + cited document IDs 

copycat, citation influence mod- 
els 

document-document 

words + cited document IDs 

latent topic hypertext model 

document-document 

words + cited document IDs 

author-recipient-topic model 

- 

docs + authors + recipients 

author-topic model 

- 

docs + authors 

topic link LDA 

document-document 

words + authors 

MMSB 

entity-entity 

- 

sparse blockmodel (Parkkinen et 
al.) 

entity-entity 


Nubbi 

entity-entity 

words near entities or entity-pairs 

group topic model 

entity-entity 

words about the entity-entity 
event 

Block-LDA 

entity-entity 

words + entities 


TABLE 13.1 

Related work. 


proteins participating in interactions are mapped to these 15 functional categories with an average 
of 2.5 annotations per protein. 

We also use another dataset of protein-protein interactions in yeast that were observed as a result 
of wetlab experiments by collaborators. This dataset consists of 635 interactions that deal primarily 
with ribosomal proteins and assembly factors in yeast. 

In addition to the MIPS PPI data, we use a text corpus that is derived from the repository of sci- 
entific publications at PubMed®. PubMed is a free, open-access, on-line archive of over 18 million 
biological abstracts, bibliographies, and citation lists for papers published since 1948 (U.S. National 
Library of Medicine, 2008). The subset we work with consists of approximately 40,000 publications 


Metabolism 

Cellular communication/signal transduction mechanism 

Cell rescue, defense and virulence 

Regulation of / interaction with cellular environment 

Cell fate 

Energy 

Control of cellular organization 
Cell cycle and DNA processing 
Subcellular localisation 
Transcription 
Protein synthesis 
Protein activity regulation 
Transport facilitation 

Protein fate (folding, modification, destination) 

Cellular transport and transport mechanisms 


TABLE 13.2 

List of functional categories. 
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(a) MIPS interactions 


(b) Co-occurences in text 


FIGURE 13.2 

Observed protein-protein interactions compared to thresholded co-occurrence in text. 


about the yeast organism that have been curated in the Saccharomyces Genome Database (SGD) 
(Dwight et al., 2004) with annotations of proteins that are discussed in the publication. We further 
restrict the dataset to only those documents that are annotated with at least one protein from the 
MIPS database. This results in a MIPS-protein annotated document collection of 15,776 publica- 
tions. The publications in this set were written by a total of 47,215 authors. We tokenize the titles 
and abstracts based on white space, lowercase all tokens, and eliminate stopwords. Low frequency 
(<5 occurrences) terms are also eliminated. The vocabulary contains 45,648 words. 


13.4 Datasets 

To investigate the co-occurrence patterns of proteins annotated in the abstracts, we construct a co- 
occurrence matrix. From every abstract, a link is constructed for every pair of annotated protein 
mentions. Additionally, protein mentions that occur fewer than 5 times in the corpus are discarded. 
Figure 13.2(b) shows that the resultant matrix looks very similar to the MIPS PPI matrix in Figure 
13.2(a). This suggests that joint modeling of the protein-annotated text with the PPI information 
has the potential to be beneficial. The nodes representing proteins in Figures 13.2(a) and 13.2(b) are 
ordered by their cluster IDs, obtained by clustering them using k-means clustering, treating proteins 
as 15-bit vectors of functional category annotations. 

The Enron email corpus (Shetty and Adibi, 2004) is a large publicly available collection of email 
messages subpoenaed as part of the investigation by the Federal Energy Regulatory Commission 
(FERC). The dataset contains 5 17,437 messages in total. Although the Enron Email Dataset contains 
the email folders of 150 people, two people appear twice with different usernames, and one user’s 
emails consist solely of automated emails resulting in 147 unique people in the dataset. For the text 
component of the model, we use all the emails in the Sent 1 folders of the 147 users’ mailboxes, 
resulting in a corpus of 96,103 emails. Messages are annotated with mentions of people from the set 

'“Sent”, "sentjtems” and “_sent_mail” folders in users’ mailboxes were treated as “Sent” folders. 
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of 147 Enron employees if they are senders or recipients of the email. Mentions of people outside of 
the 147 persons considered are dropped. While extracting text from the email messages, “quoted” 
messages are eliminated using a heuristic which looks for a “Forwarded message” or “Original 
message” delimiter. In addition, lines starting with a “>” are also eliminated. The emails are then 
tokenized after lowercasing the entire message, using whitespace and punctuation marks as word 
delimiters. Words occurring fewer than 5 times in the corpus are discarded. The vocabulary of the 
corpus consists of 32,880 words. 

For the entity links component of the model, we build an email communication network by 
constructing a link between the sender and every recipient of an email message for every email in 
the corpus. Recipients of the emails include people directly addressed in the “To” field and peo- 
ple included in the “Cc” and “Bcc” fields. Similar to the text component, only links between the 
147 Enron employees are considered. The link dataset generated in this manner has 200,404 links. 
Figure 13.3(a) shows the email network structure. The nodes in the matrix representing people are 
ordered by cluster IDs obtained by running k-means clustering on the 147 people. Each person s is 
represented by a vector of length 147, where the elements in the vector are normalized counts of the 
number of times an email is sent by s to the person indicated by the element. 
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(a) Observed network 


(b) From sparse blockmodel 


(c) From Block-LDA 


FIGURE 13.3 

Enron network and its de-noised recovered versions. 


13.5 Experimental Results 

We present results from experiments using Block-LDA to model the yeast and Enron datasets de- 
scribed in Section 13.4. 

13.5.1 Results from the Yeast Dataset 
Perplexity and Convergence 

First, we investigate the convergence properties of the Gibbs sampler used for inference in Block- 
LDA by observing link perplexity on held-out data at different epochs. Link perplexity of a set of 
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links L is defined as 


exp 
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(13.7) 


Figure 13.4(a) shows the convergence of the link perplexity using Block LDA and a baseline 
model on the PPI+SGD dataset with 20% of the full dataset held-out for testing. The number of 
topics K is set at 15 since our aim is to recover topics that can be aligned with the 15 protein 
functional categories, an and ckl are sampled from Gamma(0.1, 1). It can be observed that the 
Gibbs sampler burns-in after about 20 iterations. 

Next, we perform two sets of experiments with the PPI+PubMed Central dataset. The text data 
has three types of entities in each document — words, authors, and protein annotations with the PPI 
data-linking proteins. In the first set of experiments, we evaluate the model using perplexity of 
held-out protein-protein interactions using increasing amounts of the PPI data for training. 


o 



Iterations 


(a) Gibbs sampler convergence 


i I- 


i I 



0.33 0.5 0.66 0.33 0.5 0.66 

Fraction of links used for training Fraction of docs used for training 


(b) Gain in perplexity through joint modeling 


FIGURE 13.4 

Perplexity in the MIPS PPI+SGD dataset. 

All 15,773 documents in the SGD dataset are used when textual information is used. When 
text is not used, the model is equivalent to using only the left half of Figure 13.1. Figures 13.5(a) 
and 13.5(b) show the posterior likelihood of protein-protein interactions recovered using the sparse 
blockmodel and Block-LDA, respectively. In the other set of experiments, we evaluate the model 
using protein perplexity in held-out text using progressively increasing amounts of text as training 
data. All the links in the PPI dataset are used in these experiments when link data are used. When 
link data are not used, the model reduces to Link LDA. In all experiments, the Gibbs sampler is run 
until the held-out perplexity stabilizes to a nearly constant value (as 80 iterations). 

Figure 13.4(b) shows the gains in perplexity in the two sets of experiments with different 
amounts of training data. The perplexity values are averaged over 10 runs. In both sets of exper- 
iments, it can be seen that Block-LDA results in lower perplexities than using links/text alone. 
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FIGURE 13.5 

Inferred protein-protein interactions. 


These results indicate that co-occurrence patterns of proteins in text contain information about pro- 
tein interactions, which Block-LDA is able to utilize through joint modeling. Our conjecture is that 
the protein co-occurrence information in text is a noisy approximation of the PPI data. 

Table 13.3 shows the top words, proteins, and authors for sample topics induced by running 
Block-LDA over the full PPI+SGD dataset. These topics provide a qualitative feel for the topics 
that emerge using the model. The Gibbs sampling procedure was run until convergence (around 
80 iterations) and the number of topics was set to 15. The topic tables were then analyzed, with a 
title and an analysis of the topic added after the inference procedure was completed. Details about 
proteins and yeast researchers were obtained from the SGD 2 website to understand the function of 
the top proteins in each topic and to get an idea of the research profile of the top authors mentioned. 

Topic Coherence 

A useful application of latent blockmodeling approaches is understanding the underlying nature of 
data. 

We conduct three different evaluations of the emergent topics. First, we obtain topics from only 
the text corpus using a model that comprises the right half of Figure 13.1, which is equivalent to 
using the Link-LDA model. For the second evaluation, we use the Block-LDA model that is trained 
on the text corpus and the MIPS protein-protein interaction database. Finally, for the third evalua- 
tion, we replace the MIPS PPI database with the interaction obtained from the wetlab experiments. 
In all cases, we set K , the number of topics, to be 15. In each variant, we represent documents as 
three sets of entities, i.e., the words in the abstracts of the article, the set of proteins associated with 
the article as indicated in the SGD database, and the authors who wrote the article. Each topic there- 
fore consists of three different multinomial distributions over the sets of the three kinds of entities 
described. 

Topics that emerge from the different variants can possibly be assigned different indices even 
when they discuss the same semantic concept. To compare topics across variants, we need a method 
to determine which topic indices from the different variants correspond to the same semantic con- 
cept. To obtain the mapping between topics from each variant, we utilize the Hungarian algorithm 

-See http://www.yeastgenome.org. 
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Words 

Proteins 

Authors 

mutant, mutants, gene, cerevisiae, growth, type, mutations, saccharomyces, wild, mu- 
tation, strains, strain, phenotype, genes, deletion 

rpl20b, rpl5, rpll6a, rps5, rpl39, rpll8a, rpl27b, rps3, rpl23a, rpllb, rpl32, rpll7b, 
rpl35a, rpl26b, rpl31a 

klis_fm, bussey_h, miyakawa.t, toh-e_a, heitman.j, perfect.jr, ohya_y_ws, sherman_f, 
latge.jp, schaffrath_r, duran_a, sa-correia_i, liu_h, subik.j, kikuchi_a, chen.j, goffeau_a, 
tanakaJc, kuchler Jc, calderone _r, nombela.c, popolo.1, jablonowski.d, kim.j 

Analysis 

A common experimental procedure is to induce random mutations in the ”wild-type” 
strain of a model organism (e.g., saccharomyces cerevisiae) and then screen the mu- 
tants for interesting observable characteristics (i.e. phenotype). Often the phenotype 
shows slower growth rates under certain conditions (e.g. lack of some nutrient). The 
RPL* proteins are all part of the larger (60S) subunit of the ribosome. The first two 
biologists, Klis and Bussey’s research use this method. 


(a) Analysis of mutations 


Words 

Proteins 

Authors 

binding, domain, terminal, structure, site, residues, domains, interaction, region, sub- 
unit, alpha, amino, structural, conserved, atp 

rpsl9b, rps24b, rps3, rps20, rps4a, rpslla, rps2, rps8a, rpslOb, rps6a, rpslOa, rpsl9a, 
rpsl2, rps9b, rps28a 

naider_f, becker.jm, leulliotm, van_tilbeurgh_h, melki_r, velours.j, graille_m_s, janin_- 
j, zhou.cz, blondeau_k, ballesta.jp, yokoyama_s, bousset.1, vershon_ak, bowler.be, 
zhang.y, arshava_b, buchner.j, wickner_rb, steven_ac, wang.y, zhang_m, forgac_m, 
brethes.d 

Analysis 

Protein structure is an important area of study. Proteins are composed of amino-acid 
residues, functionally important protein regions are called domains, and functionally 
important sites are often ’’converved” (i.e., many related proteins have the same amino- 
acid at the site). The RPS* proteins all part of the smaller (40S) subunit of the ribo- 
some. Naider, Becker, and Leulliot study protein structure. 


(b) Protein structure 


Words 

Proteins 

Authors 

transcription, ii, histone, chromatin, complex, polymerase, transcriptional, ma, pro- 
moter, binding, dna, silencing, h3, factor, genes 

rpll6b, rpl26b, rpl24a, rpll8b, rpll8a, rpll2b, rpl6b, rpp2b, rpll5b, rpl9b, rpl40b, 
rpp2a, rpl20b, rpll4a, rppO 

workman.jl, struhLk, winston_f, buratowski_s, tempst_p, erdjument-bromageJi, kom- 
berg_rd_a, svejstrup.jq, peterson.cl, berger_sl, grunstein_m, stillman.dj, cote.j, caims.- 
br, shilatifard.a, hampsey_m, allis.cd, young_ra, thuriaux_p, zhang_z, stemglanz_r, kro- 
gan_nj, weil_pa, pillus.1 

Analysis 

In transcription, DNA is unwound from histone complexes (where it is stored com- 
pactly) and converted to RNA. This process is controlled by transcription factors, 
which are proteins that bind to regions of DNA called promoters. The RPL* proteins 
are part of the larger subunit of the ribosome, and the RPP proteins are part of the 
ribosome stalk. Many of these proteins bind to RNA. Workman, Struhl, and Winston 
study transcription regulation and the interaction of transcription with the restructur- 
ing of chromatin (a combination of DNA, histones, and other proteins that comprises 
chomosomes). 


(c ) Chromosome remodeling and transcription 


TABLE 13.3 

Top words, proteins, and authors: Topics obtained using Block-LDA on the PPI+SGD dataset. 
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(Kuhn, 1955) to solve the assignment problem where the cost of aligning topics together is deter- 
mined using the Jensen-Shannon divergence measure. 

Once the topics are obtained, we first obtain the proteins associated with the topic by retrieving 
the top proteins from the multinomial distribution corresponding to proteins. Then, the top articles 
corresponding to each topic are obtained using a ranked list of documents with the highest mass of 
their topic proportion distributions (9) residing in the topic considered. 

Manual Evaluation 

To evaluate the topics, a yeast biologist who is an expert in the field was asked to mark each topic 
with a binary flag indicating if the top words of the distribution represented a coherent sub-topic in 
yeast biology. The top words of the distribution representing a topic were presented as a ranked list 
of words. This process was repeated for the three different variants of the model. The variant used 
to obtain results is concealed from the evaluator to remove the possibility of bias. 

In the next step of the evaluation, the top articles and proteins assigned to each topic were 
presented in a ranked list and a similar judgment was requested to indicate if the article/protein was 
relevant to the topic in question. Similar to the topic coherence judgments, the process was repeated 
for each variant of the model. Screenshots of the tool used for obtaining the judgments can be seen 
in Figure 13.6. It should be noted that since the nature of the topics in the literature considered was 
highly technical and specialized, it was impractical to get judgments from multiple annotators. 

To evaluate the retrieval of the top articles and proteins, we measure the quality of the results by 
computing the precision® 10 score. 

First, we evaluate the coherence of the topics obtained from the three variants described above. 
Table 13.4 shows that out of the 15 topics that were obtained, 12 topics were deemed coherent from 
the text-only model and 13 and 15 topics were deemed coherent from the Block-LDA models using 
the MIPS and wetlab PPI datasets, respectively. 


Variant 

Num. Coherent Topics 

Only Text 

12/15 

Text + MIPS 

13/15 

Text + Wetlab 

15/15 


TABLE 13.4 

Topic coherence evaluation. 

Next, we study the precision® 10 values for each topic and variant of the article retrieval and 
protein retrieval tasks (see Figures 13.7 or 13.8, respectively). The horizontal lines in the plots 
represent the mean of the precision® 10 across all topics. It can be seen from the plots that for both 
the article and protein retrieval tasks, on average the joint models work better than the text-only 
model. For the article retrieval task, the model trained with the text + MIPS resulted in the higher 
mean precision® 10 whereas for the protein retrieval task, the text + Wetlab PPI dataset returned 
a higher mean precision® 10 value. For both the protein retrieval and paper retrieval tasks, the 
improvements shown by the joint models using either of the PPI datasets over the text-only model 
(i.e., the Link LDA model) were statistically significant at the 0.05 level using the paired Wilcoxon 
sign test. However, the difference in performance between the two joint models that used the two 
different PPI networks was insignificant, which indicates that there is no observable advantage in 
using one PPI dataset over the other in conjunction with the text corpus. 

Functional Category Prediction 

Proteins are identified as belonging to multiple functional categories in the MIPS PPI dataset, as 
described in Section 13.4. We use Block-LDA and baseline methods to predict proteins’ functional 
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FIGURE 13.6 

Screenshot: Article relevance annotation tool. 


Analysis Tools ~ 

topic_l protein structure binding * 

( Submit") 


9987 results for #file:topic_l[] (0.556 secs). 

Papers (9912) Genes (25) Authors (25) 

Tab score: 2.5E-5 

Results 1-20 of 9912 Page 1 | 2 | 3 | 4 | 5 | 6 of 496 


ZStl The crystal structure of the peptide-binding fragment from the 
veast HSD40 Drotein Sisl. Search nearhv Search SGD+ Search PL]hMert , ll , 
Journal Structure 

1.0000 

Authors 

Cvr DM . Lee S . Sha B 


Genes 

SIS1 , YDJ1 


Year 

2000 , 2001 


PMID 

10997899 



Abstract BACKGROUND: Molecular chaperone Hsp40 can bind non-native polypeptide and facilitate Hsp70 
in protein refolding. How Hsp40 and other chaperones distinguish between the folded and 
unfolded states of proteins to bind nonnative polypeptides is a fundamental issue. RESULTS: To 
investigate this mechanism, we determined the crystal structure of the peptide-binding fragment 
of Sisl, an essential member of the Hsp40 family from Saccharomyces cerevisiae. The 2.7 A 
structure reveals that Sisl forms a homodimer in the crystal by a crystallographic twofold axis. 
Sisl monomers are elongated and consist of two domains with similar folds. Sisl dimerizes 
through a short C-terminal stretch. The Sisl dimer has a U-shaped architecture and a large cleft 
is formed between the two elongated monomers. Domain I in each monomer contains a 
hydrophobic depression that might be involved in binding the sidechains of hydrophobic amino 
acids. CONCLUSIONS: Sisl (1-337), which lacks the dimerization motif, exhibited severe defects 
in chaperone activity, but could regulate Hsp70 ATPase activity. Thus, dimer formation is critical 
for Sisl chaperone function. We propose that the Sisl cleft functions as a docking site for the 
Hsp70 peptide-binding domain and that Sisl-Hsp70 interaction serves to facilitate the efficient 
transfer of peptides from Sisl to Hsp70. Search these keywords 


2 S3* Characterization of four covalently-linked yeast cytochrome 0.9860 

c/cytochrome c peroxidase complexes: Evidence for electrostatic 
interaction between bound cytochrome c molecules, search nearhv 
SearchSGD# Search PubMed *fr 
Journal Biochemistry 
Authors Erman JE , Nakani S , Vitello LB 
Genes CCP1 , CYC1 


categories and evaluate them by comparing them to the ground truth in the MIPS dataset using the 
method presented in prior work (Airoldi et al., 2008). A model is first trained with K set to 15 topics 
to recover the 15 top-level functional categories of proteins. Every topic that is returned consists of a 
set of multinomials including /3 tl , the topic-wise distribution over all proteins. The values of /3 t| are 
thresholded such that the top « 16% (the density of the protein-function matrix) of entries are con- 
sidered as such a positive prediction that the protein falls in the functional category corresponding to 
the latent topic. To determine the mapping of latent topic to functional category, 10% of the proteins 
are used in a procedure that greedily finds the alignment resulting in the best accuracy, as described 
in Airoldi et al. (2008). It is important to note that the true functional categories of proteins are 
completely hidden from the model. The functional categories are used only during evaluation of the 
resultant topics from the model. 

The precision, recall, and F\ scores of the different models in predicting the right functional 
categories for proteins are shown in Table 13.5. Since there are 15 functional categories and a 
protein has approximately 2.5 functional category associations, we expect only ~l/6 of protein- 
functional category associations to be positive. Precision and recall therefore depict a better picture 
of the predictions than accuracy. For the random baseline, every protein-functional category pair is 
randomly deemed to be 0 or 1 with the Bernoulli probability of an association being proportional to 
the ratio of Is observed in the protein-functional category matrix in the MIPS dataset. In the MMSB 
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approach, induced latent blocks are aligned to functional categories as described in Airoldi et al. 
(2008). 

We see that the J\ scores for the baseline sparse blockmodel and MMSB are nearly the same, 
and that combining text and links provides a significant boost to the F- t score. This suggests that 
protein co-occurrence patterns in the abstracts contain information about functional categories that 
is also evidenced by the better than random F\ score obtained using Link LDA, which uses only 
documents. All the methods considered outperform the random baseline. 
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FIGURE 13.7 

Retrieval performance - Article retrieval. 
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FIGURE 13.8 

Retrieval performance - Protein retrieval. 


13.5.2 Results from the Enron Email Corpus Dataset 

As described in Section 13.4, the Enron dataset consists of two components — text from the sent 
folders and the network of senders and recipients of emails within the Enron organization. Each 
email is treated as a document and is annotated with a set of people consisting of the senders and re- 
cipients of the email. We first study the network reconstruction capability of the Block-LDA model. 
Block-LDA is trained using all 96,103 emails in the sent folders and the 200,404 links obtained 
from the full email corpus. Figures 13.3(a), 13.3(b), and 13.3(c) show the true communication 
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Method 

Ei 

Precision 

Recall 

Block-LDA 

0.249 

0.247 

0.250 

Sparse Blockmodel 

0.161 

0.224 

0.126 

Link LDA 

0.152 

0.150 

0.155 

MMSB 

0.165 

0.166 

0.164 

Random 

0.145 

0.155 

0.137 


TABLE 13.5 

Functional category prediction. 


matrix, the matrix reconstructed using the sparse mixed membership stochastic blockmodel and the 
matrix reconstructed using the Block-LDA model, respectively. The figures show that both models 
are approximately able to recover the communication network in the Enron dataset. 



(a) Determining number of 
topics 


(b) Heldout link perplexity 


(c) Heldout people perplex- 
ity 


FIGURE 13.9 

Enron corpus: Perplexity. 

Figure 13.9(a) shows the link perplexity and person perplexity in text of held-out data, as the 
number of topics is varied. Person perplexity is indicative of the surprise inherent in observing a 
sender or a recipient and can be used as a prior in tasks like predicting recipients for emails that are 
being composed. Link perplexity is a score for the quality of link prediction and captures the notion 
of social connectivity in the graph. It indicates how well the model is able to capture links between 
people in the communication network. The person perplexity in the plot decreases initially and 
stabilizes when the number of topics reaches 20. It eventually starts to rise again when the number 
of topics is raised above 40. The link perplexity on the other hand stabilizes at 20 and then exhibits 
a slight downward trend. For the remaining experiments with the Enron data, we set K = 40. 

In the next set of experiments, we evaluate Block-LDA and other models by evaluating the per- 
son perplexity in held-out emails by varying the training and test set size. Similar to the experiments 
with the PPI data, the Gibbs sampler is run until the held-out perplexity stabilizes to a nearly con- 
stant value (« 80 iterations). The perplexity values are averaged over 10 runs. Figure 13.9(c) shows 
the person perplexity in text of held-out data as increasing amounts of the text data are used for 
training. The remainder of the dataset is used for testing. It is important to note that only Block- 
LDA uses the communication link matrix. A consistent improvement in person perplexity can be 
observed when email text data are supplemented with communication link data irrespective of the 
training set size. This indicates that the latent block structure in the links is beneficial while shaping 
latent topics from text. 
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Block-LDA is finally evaluated using link prediction. The sparse blockmodel, which serves as a 
baseline, does not use any text information. Figure 13.9(b) shows the perplexity in held-out data with 
varying amounts of the 200,404 edges in the network used for training. When textual information is 
used, all 96,103 emails are used. The histogram shows that Block-LDA obtains lower perplexities 
than the sparse blockmodel, which uses only links. As in the PPI experiments, using the text in the 
emails improves the modeling of the network of senders and recipients, although the effect is less 
marked when the number of links used for training is increased. The topical coherence in the latent 
topics induces better latent blocks in the matrix indicating a transfer of signal from the text to the 
network model. 


13.6 Conclusion 

We proposed a model that jointly models entity-entity links and entity-annotated text that permits 
co-occurence information in text to influence link modeling and vice-versa. Our experiments show 
that joint modeling outperforms approaches that use only a single source of information. Improve- 
ments are observed when the joint model is evaluated internally using perplexity in two different 
datasets and externally using protein functional category prediction in the yeast dataset. We also 
evaluated topics obtained from the joint modeling of yeast biology literature and protein-protein 
interactions in yeast and compared them to topics that were obtained from using only the literature. 
The topics were evaluated for coherence and by measuring the mean precision® 10 score of the top 
articles and proteins that were retrieved for each topic. Evaluation by a domain expert showed that 
the joint modeling produced more coherent topics and showed better precision® 10 scores in the 
article and protein retrieval tasks indicating that the model enabled information sharing between the 
literature and the PPI networks. 


Acknowledgments 


This work was funded by grant 1R101GM081293 from NIH, IIS-081 1562 from NSF, and by a gift 
from Google. The opinions expressed in this paper are solely those of the authors. 


References 

Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. (2008). Mixed membership stochastic 
blockmodels. Journal of Machine Learning Research 9: 1981-2014. 

Balasubramanyan, R. and Cohen, W. W. (201 1). Block-LDA: Jointly modeling entity-annotated text 
and entity-entity links. In Proceedings of the 2011 SIAM Conference on Data Mining (SDM ’ll). 
SIAM/Omnipress, 450-461. 

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. The Journal of Machine 
Learning Research 3: 993-1022. 

Chang, J. and Blei, D. M., (2009). Relational topic models for document networks. In Proceedings 



272 


Handbook of Mixed Membership Models and Its Applications 


of the of 12 th International Conference on Artificial Intelligence and Statistics (AISTATS 2009). 
Journal of Machine Learning Research - Proceedings Track 5, 81-88. 

Chang, J., Boyd-Graber, J., and Blei, D. M. (2009). Connections between the lines: Augmenting 
social networks with text. In Proceedings of the 15 th ACM SIGKDD International Conference on 
Knowledge Discovery and Data Mining (KDD ’09). New York, NY, USA: ACM, 169-178. 

Dietz, L., Bickel, S., and Scheffer, T. (2007). Unsupervised prediction of citation influences. In 
Proceedings of the 24 ,h Annual International Conference on Machine Learning (ICML ’07). New 
York, NY, USA: ACM, 233-240. 

Dwight, S. S., Balakrishnan, R., Christie, K. R., Costanzo, M. C., Dolinski, K., Engel, S. R., Feier- 
bach, B., Fisk, D. G., Hirschman, J., Hong, E. F., Issel-Tarver, F., Nash, R. S., Sethuraman, A., 
Starr, B., Theesfeld, C. F., Andrada, R., Binkley, G., Dong, Q., Lane, C., Schroeder, M., Weng, S., 
Botstein, D. and Cherry J., M. (2004). Saccharomyces genome database: Underlying principles 
and organisation. Briefings in Bioinformatics 5: 9. 

Erosheva, E. A., Fienberg, S. E., and Lafferty, J. D. (2004). Mixed-membership models of scientific 
publications. Proceedings of the National Academy of Sciences 101: 5220. 

Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. Proceedings of the National 
Academy of Sciences 101 Suppl 1: 5228-5235. 

Gruber, A., Rosen-Zvi, M., and Weiss, Y. (2008). Latent topic models for hypertext. In Proceedings 
of the 24' 1 ' Conference on Uncertainty in Artificial Intelligence (UAI 2008). Corvallis, OR, USA: 
AUAI Press, 230-239. 

Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics 
Quarterly 2: 83-97. 

McCallum, A., Corrada-Emmanuel, A., and Wang, X. (2005). Topic and role discovery in social 
networks. Proceedings of the 1 9 th International Joint Conference on Artificial Intelligence ( IJCAI 
’05). IJCAI, 786—791. 

Mewes, H. -W., Amid, C., Arnold, R., Frishman, D., Gldener, U., Mannhaupt, G., Mnsterktter, M., 
Pagel, P, Strack, N., Stmpflen, V., Warfsmann, J., and Ruepp, A. (2004). MIPS: Analysis and 
annotation of proteins from whole genomes. Nucleic Acids Research 32: 41^-4. 

Nallapati, R. M., Ahmed, A., Xing, E. R, and Cohen, W. W. (2008). Joint latent topic models for text 
and citations. In Proceeding of the 14 th ACM SIGKDD International Conference on Knowledge 
Discovery and Data mining (KDD ’08). New York, NY, USA: ACM, 542-550. 

Parkkinen, J., Sinkkonen, J., Gyenge, A., and Kaski, S. (2009). A block model suitable for sparse 
graphs. In Proceedings of the 7 ,h International Workshop on Mining and Learning with Graphs 
(MLG 2009). Leuven, Belgium: poster presented. 

Rosen-Zvi, M., Griffiths, T. L., Steyvers, M., and Smyth, P. (2004). The author-topic model for 
authors and documents. In Proceedings of the 20 th Conference on Uncertainty in Artificial Intel- 
ligence (UAI 2004). Arlington, VA, USA: AUAI Press, 487^194. 

Shetty, J. and Adibi, J. (2004). The Enron Email Dataset Database Schema and Brief Statistical 
Report. Tech, report. Information Sciences Institute. 

Wang, X., Mohanty, N., and McCallum, A. (2006). Group and topic discovery from relations and 
their attributes. In Weiss, Y., Scholkopf, B., and Platt, J. (eds). Advances in Neural Information 
Processing Systems 18. Cambridge, MA: The MIT Press, 1449 — 1456. 



Block-LDA: Jointly Modeling Entity-Annotated Text and Entity-Entity Links 


273 


Wen, Z. and Lin, C. -Y. (2010). Towards finding valuable topics. In Proceedings of the 2010 SIAM 
Conference on Data Mining (SDM ’10). Philadelphia, PA, USA: SIAM, 720-731. 




14 


Robust Estimation of Topic Summaries Leveraging 
Word Frequency and Exclusivity 


Jonathan M. Bischof 

Department of Statistics, Harvard University, Cambridge, MA 02138, USA 

Edoardo M. Airoldi 

Department of Statistics, Harvard University, Cambridge, MA 02138, USA 


CONTENTS 

14.1 Introduction 276 

14.2 A Mixed Membership Model for Poisson Data 277 

14.2.1 Modeling Word Usage Rates on the Hierarchy 278 

14.2.2 Modeling the Topic Membership of Documents 278 

14.2.3 Estimands for Text Analysis 279 

14.3 Scalable Inference via Parallelized HMC Sampler 280 

14.3.1 A Blocked Gibbs Sampling Strategy 281 

Updating Tree Parameters 281 

Updating Topic Affinity Parameters 282 

Updating Corpus-Level Parameters 282 

14.3.2 Estimation 283 

14.3.3 Inference on Missing Document Categories 283 

14.4 Empirical Analysis and Results 284 

14.4.1 An Overview of the Reuters Corpus 284 

14.4.2 The Differential Usage Parameters Regulate Topic Exclusivity 285 

14.4.3 Frequency Modulates the Regularization of Exclusivity 285 

14.4.4 A Better Two-Dimensional Summary of Semantic Content 288 

14.4.5 Classification Performance 291 

14.5 Concluding Remarks 291 

14.5.1 Toward Automated Evaluation of Topic Models 294 

Appendix: Implementing the Parallelized HMC Sampler 294 

References 299 


An ongoing challenge in the analysis of document collections is how to summarize content in terms 
of a set of inferred themes that can be interpreted substantively in terms of topics. However, the cur- 
rent practice in mixed membership models of text (Blei et al., 2003) of parametrizing the themes in 
terms of most frequent words limits interpretability by ignoring the differential use of words across 
topics. Words that are both common and exclusive to a theme are more effective at characterizing 
the topical content of such a theme. We consider a setting where professional editors have annotated 
documents to a collection of topic categories, organized into a tree, in which leaf-nodes correspond 
to the most specific topics. Each document is annotated to multiple categories, at different levels 
of the tree. We introduce hierarchical Poisson convolution (HPC) as a model to analyze annotated 
documents in this setting. The model leverages the structure among categories defined by profes- 
sional editors to infer a clear semantic description for each topic in terms of words that are both 
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frequent and exclusive. We develop a parallelized Hamiltonian Monte Carlo sampler that allows the 
inference to scale to millions of documents. 


14.1 Introduction 

A recurrent challenge in multivariate statistics is how to construct interpretable low-dimensional 
summaries of high-dimensional data. Historically, simple models based on correlation matrices, 
such as principal component analysis (Jolliffe, 1986) and canonical correlation analysis (Hotelling, 
1936) have proven to be effective tools for data reduction. More recently, multilevel models have be- 
come a flexible and powerful tool for finding latent structure in high-dimensional data (McLachlan 
and Peel, 2000; Sohn and Xing, 2009; Blei et al., 2003; Airoldi et al., 2008). However, while inter- 
pretable statistical summaries are highly valued in applications, dimensionality reduction models are 
rarely optimized to aid qualitative discovery; there is no guarantee that the optimal low-dimensional 
projections will be understandable in terms of quantities of scientific interest that can help practi- 
tioners make decisions. Here, we design a model that targets scientific estimands of interest in text 
analysis and achieves a good balance between interpretability and dimensionality reduction. 

We consider a setting in which we observe two sets of categorical data for each unit of obser- 
vation: W\-v, which live in a high-dimensional space, and 1 \ : k , which live in a structured low- 
dimensional space and provide a direct link to information of scientific interest about the sampling 
units. The goal of the analysis is twofold. First, we desire to develop a joint model for the observa- 
tions Y = { W Dyv ■, L [) x k } that can be used to project the data onto a low-dimensional parameter 
space 0 in which interpretability is maintained by mapping categories in £ to directions in ©. 
Second, we would like the mapping from the original space to the low-dimensional projection to 
be scientifically interesting so that statistical insights about 0 can be understood in terms of the 
original inputs, Wi-.v, in a way that guides future research. 

In the application to text analysis that motivates this work, w -\ -m are the raw word counts ob- 
served in each document and I-^-k are a set of labels created by professional editors that are indica- 
tive of topical content. Specifically, the words are represented as an unordered vector of counts, with 
the length of the vector corresponding to the size of a known dictionary. The labels are organized 
in a tree-structured ontology, from the most generic topic at the root of the tree to the most specific 
topic at the leaves. Each news article may be annotated with more than one label, at the editors’ 
discretion. The number of labels is given by the size of the ontology and typically ranges from tens 
to hundreds of categories. In this context, the inferential challenge is to discover a low-dimensional 
representation of topical content, 0, that aligns with the coarse labels provided by editors while at 
the same time providing a mapping between the textual content and directions in © in a way that 
formalizes and enhances our understanding of how low-dimensional structure is expressed in the 
space of observed words. 

Recent approaches to this problem in the machine learning literature have taken a Bayesian 
hierarchical approach to this task by viewing a document’s content as arising from a mixture of 
component distributions, commonly referred to as “topics,” as they often capture thematic structure 
(Blei, 2012). As the component distributions are almost exclusively parameterized as multinomial 
distributions over words in the vocabulary, the loading of words onto topics is characterized in 
terms of the relative frequency of within-component usage. While relative frequency has proven 
to be a useful mapping of topical content onto words, recent work has documented a growing list 
of interpretability issues with frequency-based summaries: they are often dominated by contentless 
“stop” words (Wallach et al., 2009), sometimes appear incoherent or redundant (Mimno et al., 201 1 ; 
Chang et al., 2009), and typically require post hoc modification to meet human expectations (Hu 
et al., 2011). Instead, we propose a new mapping for topical content that incorporates how words 
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are used differentially across topics. If a word is common in a topic, it is also important to know 
whether it is common in many topics or relatively exclusive to the topic in question. Both of these 
summary statistics are informative: nonexclusive words are less likely to carry topic-specific con- 
tent, while infrequent words occur too rarely to form the semantic core of a topic. We therefore 
look for the most frequent words in the corpus that are also likely to have been generated from 
the topic of interest to summarize its content. In this approach we borrow ideas from the statistical 
literature, in which models of differential word usage have been leveraged for analyzing writing 
styles in a supervised setting (Mosteller and Wallace, 1984; Airoldi et al., 2005; 2006; 2007a), and 
combine them with ideas from the machine learning literature, in which latent variable and mixture 
models based on frequent word usage have been used to infer structure that often captures topical 
content (McCallum et al., 1998; Blei et al., 2003; Canny, 2004; Airoldi et al., 2007b; 2010a). From 
a statistical perspective, models based on topic-specific distributions over the vocabulary cannot 
produce stable estimates of differential usage since they only model the relative frequency of words 
within topics. They cannot regularize usage across topics and naively infer the greatest differential 
usage for the rarest features (Eisenstein et al., 2011). To tackle this issue, we introduce the gener- 
ative framework of hierarchical Poisson convolution (HPC) that parameterizes topic-specific word 
counts as unnormalized count variates whose rates can be regularized across topics as well as within 
them, making stable inference of both word frequency and exclusivity possible. HPC can be seen 
as a fully generative extension of sparse topic coding (Zhu and Xing, 2011) that emphasizes reg- 
ularization and interpretability rather than exact sparsity. Additionally, HPC leverages hierarchical 
systems of topic categories created by professional editors in collections such as Reuters, the New 
York Times, Wikipedia, and Encyclopedia Britannica to make focused comparisons of differential 
use between neighboring topics on the tree and build a sophisticated joint model for topic member- 
ships and labels in the documents. By conditioning on a known hierarchy, we avoid the complicated 
tasks of inferring hierarchical structure (Blei et al., 2004; Mimno et al., 2007; Adams et al., 2010) 
as well as the number of topics (Joutard et al., 2008; Airoldi et al., 2010a;b). We introduce a par- 
allelized Hamiltonian Monte Carlo (HMC) estimation strategy that makes full Bayesian inference 
efficient and scalable. 

Since the proposed model is designed to infer an interpretable description of human-generated 
labels, we restrict the topic components to have a one-to-one correspondence with the human- 
generated labels, as in Labeled LDA (Ramage et al., 2009). This descriptive link between the labels 
and topics differs from the predictive link used in Supervised LDA (Blei and McAuliffe, 2007; 
Perotte et al., 2012), where topics are learned as an optimal covariate space to predict an observed 
document label or response variable. The more restrictive descriptive link can be expected to limit 
predictive power but is crucial for learning summaries of individual labels. We then infer a descrip- 
tion of these labels in terms of words that are both frequent and exclusive. We anticipate that learning 
a concise semantic description for any collection of topics implicitly defined by professional editors 
is the first step toward the semi-automated creation of domain-specific topic ontologies. Domain- 
specific topic ontologies may be useful for evaluating the semantic content of inferred topics, or for 
predicting the semantic content of new social media, including Twitter messages and Facebook wall 
posts. 


14.2 A Mixed Membership Model for Poisson Data 

The hierarchical Poisson convolution model is a data generating process for document collections 
whose topics are organized in a hierarchy, and whose topic labels are observed. We refer to the struc- 
ture among topics interchangeably as a hierarchy or tree since we assume that each topic has exactly 
one parent and that no cyclical parental relations are allowed. Each document d £ D} is 
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a record of counts Wfd for every feature in the vocabulary, / £ {1, . . . . V } . The length of the 
document is given by Ld , which we normalize by the average document length L to get Id = j- La. 
Documents have unrestricted membership to any combination of topics k £ {1, . . . , K} represented 
by a vector of labels Id where Idk = /{doc d belongs to topic k}. 

14.2.1 Modeling Word Usage Rates on the Hierarchy 

The HPC model leverages the known topic hierarchy by assuming that words are used similarly in 
neighboring topics. Specifically, the log rate for a word across topics follows a Gaussian diffusion 
down the tree. Consider the topic hierarchy presented in the right panel of Figure 14.1. At the top 
level, pffl represents the log rate for feature / overall in the corpus. The log rates //yj , . . . , Hf,J 
for first-level topics are then drawn from a Gaussian centered around the corpus rate with dispersion 
controlled by the variance parameter r| 0 . From first-level topics, we then draw the log rates for the 
second-level topics from another Gaussian centered around their mean pf, 3 and with variance r'j ■. 
This process is continued down the tree, with each parent node having a separate variance parameter 
to control the dispersion of its children. 

The variance parameters T f P directly control the local differential expression in a branch of the 
tree. Words with high variance parameters can have rates in the child topics that differ greatly from 
the parent topic p, allowing the child rates to diverge. Words with low variance parameters will 
have rates close to the parent and so will be expressed similarly among the children. If we learn a 
population distribution for the T h that has low mean and variance, it is equivalent to saying that 
most features are expressed similarly across topics a priori and that we would need a preponderance 
of evidence to believe otherwise. 

14.2.2 Modeling the Topic Membership of Documents 

Documents in the HPC model can contain content from any of the K topics in the hierarchy at 
varying proportions, with the exact allocation given by the vector Qd on the K - 1 simplex. The 
model assumes that the count for word / contributed by each topic follows a Poisson distribution 


FIGURE 14.1 

Graphical representation of hierarchical Poisson convolution (left panel) and detail on tree plate 
(right panel). For an introduction to this type of illustration, see Airoldi (2007). 
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whose rate is moderated by the document’s length and membership to the topic; that is, Wfdk ~ 
Pois (Id&dkPfk)- The only data we observe is the total word count Wfd = J2k=i w fdk, but the 
infinite divisibility property of the Poisson distribution gives us that Wfd ~ Pois(Z<j0j/3/). These 
draws are done for every word in the vocabulary (using the same Qf) to get the content of the 
document. 1 

In labeled document collections, human coders give us an extra piece of information for each 
document Id that indicates the set of topics that contributed its content. As a result, we know ddk = 0 
for all topics k where !,u ;: = 0, and only have to determine how content is allocated between the set 
of active topics. 

The HPC model assumes that these two sources of information for a document are not gener- 
ated independently. A document should not have a high probability of being labeled to a topic from 
which it receives little content and vice versa. Instead, the model posits a latent /f-dimensional 
topic affinity vector £,/ ~ Afirp S) that expresses how strongly the document is associated 
with each topic. The topic memberships and labels of the document are different manifestations 
of this affinity. Specifically, each fu- is the log odds that topic label k is active in the docu- 
ment, with Idk ~ Bernoulli(logit _ (£dfc)). Conditional on the labels, the topic memberships are 
the relative sizes of the document’s affinity for the active topics and zero for inactive topics: 
ddk = eA dk Idk/ i e ^ dj Idj- Restricting each document’s membership vectors to the labeled top- 
ics is a natural and efficient way to generate sparsity in the mixing parameters, stabilizing inference 
and reducing the computational burden of posterior simulation. 

We outline the generative process in full detail in Table 14.1, which can be summarized in three 
steps. First, a set of rate and variance parameters are drawn for each feature in the vocabulary. 
Second, a topic affinity vector is drawn for each document in the corpus, which generate topic 
labels. Finally, both sets of parameters are then used to generate the words in each document. For 
simplicity of presentation we assume that each non-terminal node has J children and that the tree 
has only two levels below the corpus level, but the model can accommodate any tree structure. 

14.2.3 Estimands for Text Analysis 

In order to measure topical semantic content, we consider the topic-specific frequency and exclu- 
sivity of each word in the vocabulary. These quantities form a two-dimensional summary of each 
word’s relation to a topic of interest, with higher scores in both frequency and exclusivity being pos- 
itively related to topic specific content. Additionally, we develop a univariate summary of semantic 
content that can be used to rank words in terms of their semantic content. These estimands are sim- 
ple functions of the rate parameters of HPC; the distribution of the documents’ topic memberships is 
a nuisance parameter needed to disambiguate the content of a document between its labeled topics. 

A word’s topic-specific frequency, fi/k = exp I'fk, is directly parameterized in the model and 
is regularized across words (via hyperparameters ft and y 2 ) and across topics. A word’s exclusivity 
to a topic, is its usage rate relative to a set of comparison topics S: <f>f t k = Pf,k/Y/j es Pf,j- 
A topic’s siblings are a natural choice for a comparison set to see which words are overexpressed 
in the topic compared to a set of similar topics. While not directly modeled in HPC, the exclusivity 
parameters are also regularized by the rj p , since if the child rates are forced to be similar then the 
(t> ft k will be pushed toward a baseline value of 1/|<S|. We explore the regularization structure of the 
model empirically in Section 14.4. 

Since both frequency and exclusivity are important factors in determining a word’s semantic 
content, a univariate measure of topical importance is a useful estimand for diverse tasks such as 
dimensionality reduction, feature selection, and content discovery. In constructing a composite mea- 
sure, we do not want a high rank in one dimension to be able to compensate for a low rank in the 

'This is where the model's name arises: the observed feature count in each document is the convolution of (unobserved) 
topic-specific Poisson variates. 
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TABLE 14.1 

Generative process for hierarchical Poisson convolution. 
Step Generative process 


Tree parameters 


Topic 

membership 

parameters 


Data generation 


For feature / G {1, . . . , V }: 

• Draw jU/,o ~ A f{ip, f 2 ) 

• Draw Tf t o ~ Scaled Inv-x 2 (t / , o 2 ) 

• For j G {1, . . . , J} (first level of hierarchy): 

- Draw ~ A/"(/r/,o, t 2 j0 ) 

- Draw Tfj ~ Scaled Inv-x 2 (zz, cr 2 ) 

• For j G {1, . . . , J} (terminal level of hierarchy): 

- Draw . . , p. fd j ~ M{p f ,j,T 2 Lj ) 

• Define Bf.k = for k G {1, . . . , A'} 

For document d G {1, . . . , D}: 

• Draw ~ Af(rj, £ = A 2 J/c) 

• For topic fc G {1, . . . , A'}: 

- Define = 1/(1 + e ~^ dk ) 

- Draw I,ik ~ Bernoulli (pdfe) 

- Define 9 dk (I d ,£ d ) = e idk I dk / J2f=i e idj I d j 

For document d G {1, . . . , D}: 

• Draw normalized document length l d ~ -/Pois(v) 

• For every topic k and feature /: 

- Draw count Wfdk ~ Pois(Gf?J/3/) 

• Define Wf d = Wfdk (observed data) 


other, since frequency or exclusivity alone are not necessarily useful. We therefore adopt the har- 
monic mean to pull the “average” rank toward the lower score. For word / in topic k , we define the 
FREXfk score as the harmonic mean of the word’s rank in the distribution of e>/, and 

( w 1 — w 

7 7 + 7 r 

ECDFqj (</>/, fe) ECDF^J/ry.fc) 

where w is the weight for exclusivity (which we set to 0.5 as a default) and ECDF,. fc is the empirical 
cdf function applied to the values x over the first index. 



14.3 Scalable Inference via Parallelized HMC Sampler 

We use a Gibbs sampler to obtain the posterior expectations of the unknown rate and membership 
parameters (and associated hyperparameters) given the observed data. Specifically, inference is con- 
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ditioned on IV, a D x V matrix of word counts; I, a D x K matrix of topic labels; l, a /J- vector 
of document lengths; and T, a tree structure for the topics. 

Creating a scalable inference method is critical since the space of latent variables grows lin- 
early in the number of words and documents, with K(D + V) total unknowns. Our model offers 
an advantage in that the parameters can be conceptually organized into two subsets, and the pos- 
terior distribution of each subset of parameters factors nicely, conditionally on the other subset of 
parameters. On one side, the conditional posterior of the rate and variance parameters {/. if, 
factors by word given the membership parameters and the hyperparameters ip, 7 2 , v, and er 2 . On the 
other, the conditional posterior of the topic affinity parameters {C/},?=i factors by document given 
the hyperparameters r/ and £ and the rate parameters 

Therefore, conditional on the hyperparameters, we are left with two blocks of draws that can 
be broken into V or D independent threads. Using parallel computing software such as message 
passing interface (MPI), the computation time for drawing the parameters in each block is only 
constrained by resources required for a single draw. The total runtime need not significantly in- 
crease with the addition of more documents or words as long as the number of available cores 
also increases. Both of these conditional distributions are only known up to a constant and can be 
high-dimensional if there are many topics, making direct sampling impossible and random walk 
Metropolis inefficient. We are able to obtain uncorrelated draws through the use of Hamiltonian 
Monte Carlo (HMC) (Neal, 2011), which leverages the posterior gradient, and Hessian to find a 
distant point in the parameter space with high probability of acceptance. HMC works well for log 
densities that are unimodal and have relatively constant curvature. We give step-by-step instructions 
for our implementation of the algorithm in the Appendix. 

After appropriate initialization, we follow a fixed Gibbs scan where the two blocks of latent 
variables are drawn in parallel from their conditional posteriors using HMC. We then draw the 
hyperparameters conditional on all the inputed latent variables. 


14.3.1 A Blocked Gibbs Sampling Strategy 

To set up the block Gibbs sampling algorithm, we derive the relevant conditional posterior distribu- 
tions and explain how we sample from each. 


Updating Tree Parameters 

In the first block, the conditional posterior of the tree parameters factors by word: 


P({M/> T /}/= 1 |W,I,Z,V’,7 , {&}<*=! >T) oc 


II { Y[p( w fd\IdJd,fJ-f,^d)\ ■p{p.f,Tf\ip,-y 2 ,T,v,a 2 ). 

/=i ^ d=i J 


Given the conditional conjugacy of the variance parameters and their strong influence on the curva- 
ture of the rate parameter posterior, we sample the two groups conditional on each other to optimize 
HMC performance. Conditioning on the variance parameters, we can write the likelihood of the rate 
parameters as a Poisson regression where the documents are observations, the Od(Id,£d) are the 
covariates, and the Id serve as exposure weights. 

The prior distribution of the rate parameters is a Gaussian graphical model, so a priori the log 
rates for each word are jointly Gaussian with mean ipl and precision matrix A( 7 2 , r 2 , T), which 
has non-zero entries only for topic pairs that have a direct parent-child relationship. 2 The log con- 
ditional posterior is: 


2 


In practice this precision matrix can be found easily as the negative Hessian of the log-prior distribution. 
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logp(H f \W, I , l, 7 2 , v, cr 2 , {id\d=i,T) = 

D D 

- l d e dPf + w f d log ( e dPf) - 2 (w- - ^i)- 

d—1 d—1 

We use HMC to sample from this unnormalized density. Note that the covariate matrix ®dxk is 
very sparse in most cases, so we speed computation with a sparse matrix representation. 

We know the conditional distribution of the variance parameters due to the conjugacy of the 
inverse- x 2 prior with the normal distribution of the log rates. Specifically, if C(T) is the set of child 
topics of topic k with cardinality .7, then 


er 2 ,T 


Inv-x 2 



+ J2jec(l l fj 

•1 i / 



Updating Topic Affinity Parameters 

In the second block, the conditional posterior of the topic affinity vectors factors by document: 




D , V 


n \ n p{w f d\id, id, vf, ^ • p{id\id) ■ p(id\v, s )- 

d= 1 /=! 


We can again write the likelihood as a Poisson regression, now with the rates as covariates. The log 
conditional posterior for one document is: 


log p(i d \w,i,i,{Hf}'f=i,y,'Z) = 

V V K 

- Id^2(3j0d + '^2’Wfd log ((3 jO d ) - ^log(l + e~ Uk ) 

/=! /—I fc= 1 

K 1 

- " I dk)idk - 2 (id - » 7 ) T £ Hid - rj). 

k=l 

We use HMC to sample from this unnormalized density. Here the parameter vector Od is sparse 
rather than the covariate matrix ByxK- If we remove the entries of 0,i and columns of B pertaining 
to topics k where Idk = 0, then we are left with a low-dimensional regression where only the active 
topics are used as covariates, greatly simplifying computation. 


Updating Corpus-Level Parameters 

We draw the hyperparameters after each iteration of the block update. We put flat priors on these 
unknowns so that we can learn their most likely values from the data. As a result, their conditional 
posteriors only depend on the latent variables they generate. 

The log corpus-level rates pfo for each word follow a Gaussian distribution with mean ip and 
variance j 2 . The conditional distribution of these hyperparameters is available in closed form: 

V’i7 2 ,w > o}3r =1 ~^(fE/=i^,o, #), 
and 7 2 |V>, {m/,o}/ = 1 ~ I«v-x 2 (V, £ E/=i(/V,o - V>) 2 ) • 

The discrimination parameters T h independently follow an identical scaled inverse-x 2 with 
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convolution parameter v and scale parameter cr 2 , while their inverse follows a Gamma (k t = 
| , A t = ^ 2 ) distribution. We use HMC to sample from this unnormalized density. Specifically, 

v 

log p{n T ,\ T \{T 2 f } v f ^ 1 ,T) = (k t "!)EE log (Tf k ) 1 

f= 1 fceP 

1 v 

- \V\Vk t log A r - \r\VlogT(K T ) - — 

T f= 1 fc e p 

where V(l~) is the set of parent topics on the tree. Each draw of ( k t . A T ) is then transformed back 
to the (y, a 2 ) scale. 

The document-specific topic affinity parameters £,/ follow a multivariate normal distribution 
with mean parameter r) and a covariance matrix parameterized in terms of a scalar, S = A 2 Ik- 
The conditional distribution of these hyperparameters is available in closed form. For efficiency, we 
choose to put a flat prior on log A 2 rather than the original scale, which allows us to marginalize out 
?7 from the conditional posterior of A 2 : 

A 2 |{ULi~I n v-X 2 (W-l, ^ 

and T 7 |A 2 ,{£ 4 f =1 ~ ^IkJ- 


Ik(idk-jk ) 2 

DK—1 


14.3.2 Estimation 

As discussed in Section 14.2.3, our estimands are the topic-specific frequency and exclusivity of 
the words in the vocabulary, as well as the Frequency-Exclusivity (FREX) score that averages each 
word’s performance in these dimensions. We use posterior means to estimate frequency and exclu- 
sivity, computing these quantities at every iteration of the Gibbs sampler and averaging the draws 
after the burn-in period. For the FREX score, we apply the ECDF function to the frequency and 
exclusivity posterior expectations of all words in the vocabulary to estimate the true ECDF. 


14.3.3 Inference on Missing Document Categories 


In order to classify unlabeled documents, we need to find the posterior predictive distribution of 
the membership vector Jj for a new document d. Inference is based on the new document’s word 
counts w d and the unknown parameters, which we hold constant at their posterior expectation. 
Unfortunately, the posterior predictive distribution of the topic affinities is intractable without 
conditioning on the label vector, since the labels control which topics contribute content. We there- 
fore use a simpler model where the topic proportions depend only on the relative size of the affinity 
parameters: 


0* dk (td) = 


o£dk 


£f= 


o£dj 


and Idk ~ Bern 


1 + exp(-£ dfc ) 


The posterior predictive distribution of this simpler model factors into tractable components: 


P* (Id, W, I) « p(IjKj) p * (£j| {fAU ,V,Z, w d ) 

a p{ig ICj) p*( w Md> {A/}/= 1) p(€ d - \v, s). 

It is then possible to find the most likely based on the evidence from w rj alone. 
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14.4 Empirical Analysis and Results 

We analyze the fit of the HPC model to Reuters Corpus Volume I (RCV1), a large collection of 
newswire stories. First, we demonstrate how the variance parameters T fp regularize the exclusivity 
with which words are expressed within topics. Second, we show that regularization of exclusivity 
has the greatest effect on infrequent words. Third, we explore the joint posterior of the topic-specific 
frequency and exclusivity of words as a summary of topical content, giving special attention to the 
upper right corner of the plot where words score highly in both dimensions. We compare words that 
score highly on the FREX metric to top words scored by frequency alone, the current practice in 
topic modeling. Finally, we compare the classification performance of HPC to baseline models. 

14.4.1 An Overview of the Reuters Corpus 

RCV1 is an archive of 806,791 newswire stories during a twelve-month period from 1996 to 1997. 3 
As described in Lewis et al. (2004), Reuters staffers assigned stories into any subset of 102 hierar- 
chical topic categories. In the original data, assignment to any topic required automatic assignment 
to all ancestor nodes, but we removed these redundant ancestor labels since they do not allow our 
model to distinguish intentional assignments to high-level categories from assignment to their off- 
spring. In our modified annotations, the only documents we see in high-level topics are those labeled 
to them and none of their children, which maps onto general content. We preprocessed document 
tokens with the Porter stemming algorithm (leading to 300,166 unique stems) and chose the most 
frequent 3% of stems (10,421 unique stems, over 100 million total tokens) for the feature set. 4 

The Reuters topic hierarchy has three levels that divide the content into finer categories at each 
cut. At the first level, content is divided between four high-level categories: three that focus on busi- 
ness and market news (Markets, Corporate/Industrial, and Economics) and one grab bag category 
that collects all remaining topics from politics to entertainment (Government/Social). The second 
level provides fine-grained divisions of these broad categories and contains the terminal nodes for 
most branches of the tree. For example, the Markets topic is split between Equity, Bond, Money, and 
Commodity markets at the second level. The third level offers further subcategories where needed 
for a small set of second-level topics. For example, the Commodity markets topic is divided be- 
tween Agricultural (soft). Metal, and Energy commodities. We present a graphical illustration of 
the Reuters topic hierarchy in Figure 14.2. 

Many documents in the Reuters corpus are labeled to multiple topics, even after redundant an- 
cestor memberships are removed. Overall, 32% of the documents are labeled to more than one 
node of the topic hierarchy. Fifteen percent of documents have very diverse content, being labeled 
to two or more of the main branches of the tree (Markets, Commerce, Economics, and Govern- 
ment/Social). Twenty-one percent of documents are labeled to multiple second-level categories on 
the same branch (for example. Bond markets and Equity markets in the Markets branch). Finally, 
14% of documents are labeled to multiple children of the same second-level topic (for example. 
Metals trading and Energy markets in the Commodity markets branch of Markets). Therefore, a 
completely general mixed membership model such as HPC is necessary to capture the labeling pat- 
terns of the corpus. A full breakdown of membership statistics by topic is presented in Tables 14.2 
and 14.3. 


3 Available upon request from the National Institute of Standards and Technology (NIST), http://trec.nist.gov/data/reuters/ 
reuters.html. 

including rarer features did not meaningfully change the results. 
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FIGURE 14.2 

Topic hierarchy of Reuters corpus. 



14.4.2 The Differential Usage Parameters Regulate Topic Exclusivity 

A word can only be exclusive to a topic if its expression across the sibling topics is allowed to 
diverge from the parent rate. Therefore, we would only expect words with high differential usage 
parameters T fp at the parent level to be candidates for highly exclusive expression ffk in any child 
topic k. Words with child topic rates that cannot vary greatly from the parent should have nearly 
equal expression in each child meaning (pfk ~ c for a branch with C child topics. An impor- 
tant consequence is that, although the ffj, are not directly modeled in HPC, their distribution is 
regularized by learning a prior distribution on the T fp- 

This tight relation can be seen in the HPC fit. Figure 14.3 shows the joint posterior expectation 
of the differential usage parameters in a parent topic and exclusivity parameters across the child 
topics. Specifically, the left panel compares the rate variance of the children of Markets from their 
parent to exclusivity between the child topics; the right panel does the same with the two children 
of Performance, a second-level topic under the Corporate category. The plots have similar patterns. 
For low levels of differential expression, the exclusivity parameters are clustered around the base- 
line value, ^7. At high levels of child rate variance, words gain the ability to approach exclusive 
expression in a single topic. 

14.4.3 Frequency Modulates the Regularization of Exclusivity 

One of the most appealing aspects of regularization in generative models is that it acts most strongly 
on the parameters for which we have the least information. In the case of the exclusivity parameters 
in HPC we have the most data for frequent words, so for a given topic the words with low rates 
should be least able to escape regularization of their exclusivity parameters by our shrinkage prior 
on the parent ’ s T fp ■ 
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TABLE 14.2 

Topic membership statistics. 


Topic code 

Topic name 

# docs 

Any MM 

CB LI MM 

CB L2 MM 

CB L3 MM 

CCAT 

CORPORATE/INDUSTRIAL 

2170 

79.60% 

79.60% 

13.10% 

0.80% 

Cll 

STRATEGY/PLANS 

24325 

51.50 

11.50 

44.50 

4.50 

C12 

LEGAL/JUDICIAL 

11944 

99.20 

98.90 

50.20 

1.70 

C13 

REGULATION/POLICY 

37410 

85.90 

55.60 

61.40 

4.50 

C14 

SHARE LISTINGS 

7410 

30.30 

7.90 

10.30 

15.80 

C15 

PERFORMANCE 

229 

82.10 

35.80 

74.20 

1.70 

C151 

ACCOUNTS/EARNINGS 

81891 

7.90 

1.30 

0.60 

6.40 

C152 

COMMENT/FORECASTS 

73092 

18.90 

4.80 

1.60 

13.50 

C16 

INSOLVENCY/LIQUIDITY 

1920 

66.70 

31.50 

54.60 

3.60 

C17 

FUNDING/CAPITAL 

4767 

78.10 

41.40 

67.70 

5.00 

C171 

SHARE CAPITAL 

18313 

44.60 

3.20 

1.70 

41.50 

C172 

BONDS/DEBT ISSUES 

11487 

15.10 

5.70 

0.30 

9.70 

C173 

LOANS/CREDITS 

2636 

24.70 

8.50 

3.60 

15.60 

C174 

CREDIT RATINGS 

5871 

65.60 

59.00 

0.50 

7.50 

C18 

OWNERSHIP CHANGES 

30 

76.70 

23.30 

76.70 

3.30 

C181 

MERGERS/ACQUISITIONS 

43374 

34.40 

6.50 

4.80 

26.90 

C182 

ASSET TRANSFERS 

4671 

28.30 

4.70 

5.70 

21.00 

C183 

PRIVATISATIONS 

7406 

73.70 

34.20 

6.30 

44.10 

C21 

PRODUCTION/SERVICES 

25403 

76.40 

46.50 

53.60 

0.80 

C22 

NEW PRODUCTS/SERVICES 

6119 

55.00 

15.30 

49.10 

0.40 

C23 

RESEARCH/DEVELOPMENT 

2625 

77.00 

36.40 

57.80 

0.90 

C24 

CAPACITY/FACILITIES 

32153 

72.20 

33.60 

58.40 

0.90 

C31 

MARKETS/MARKETING 

29073 

46.90 

25.30 

34.60 

1.30 

C311 

DOMESTIC MARKETS 

4299 

80.60 

73.70 

9.50 

18.70 

C312 

EXTERNAL MARKETS 

6648 

78.10 

70.40 

9.60 

14.20 

C313 

MARKET SHARE 

1115 

39.70 

10.30 

5.10 

27.80 

C32 

ADVERTISING/PROMOTION 

2084 

63.80 

26.90 

52.50 

1.40 

C33 

CONTRACTS/ORDERS 

14122 

48.00 

12.60 

40.50 

0.80 

C331 

DEFENCE CONTRACTS 

1210 

68.00 

65.50 

13.30 

3.40 

C34 

MONOPOLIES/COMPETITION 

4835 

92.30 

54.90 

75.70 

14.00 

C41 

MANAGEMENT 

1083 

75.60 

52.10 

59.90 

2.00 

C411 

MANAGEMENT MOVES 

10272 

17.70 

9.60 

2.40 

8.20 

C42 

LABOUR 

11878 

99.70 

99.60 

46.50 

1.50 

ECAT 

ECONOMICS 

621 

90.50 

90.50 

9.70 

1.40 

Ell 

ECONOMIC PERFORMANCE 

8568 

43.00 

24.20 

29.10 

5.10 

E12 

MONETARY/ECONOMIC 

24918 

81.70 

75.40 

17.90 

13.70 

E121 

MONEY SUPPLY 

2182 

30.50 

23.10 

0.70 

9.20 

E13 

INFLATION/PRICES 

130 

60.00 

46.90 

28.50 

0.80 

E131 

CONSUMER PRICES 

5659 

24.70 

15.60 

6.00 

12.00 

E132 

WHOLESALE PRICES 

939 

19.00 

3.40 

0.60 

16.90 

E14 

CONSUMER FINANCE 

428 

73.80 

43.20 

61.00 

1.60 

E141 

PERSONAL INCOME 

376 

75.00 

63.80 

9.60 

22.30 

E142 

CONSUMER CREDIT 

200 

46.00 

30.00 

3.50 

18.50 

E143 

RETAIL SALES 

1206 

27.50 

19.70 

2.40 

10.20 

E21 

GOVERNMENT FINANCE 

941 

86.70 

81.40 

53.90 

4.00 

E21 1 

EXPENDITURE/REVENUE 

15768 

78.20 

72.40 

16.10 

13.80 

E212 

GOVERNMENT BORROWING 

27405 

32.70 

29.60 

2.70 

4.50 

E31 

OUTPUT/CAPACITY 

591 

45.20 

18.30 

35.20 

0.50 

E31 1 

INDUSTRIAL PRODUCTION 

1701 

17.70 

9.80 

3.10 

9.30 

E312 

CAPACITY UTILIZATION 

52 

65.40 

13.50 

3.80 

57.70 

E313 

INVENTORIES 

111 

26.10 

10.80 

0.00 

16.20 

E41 

EMPLOYMENT/LABOUR 

14899 

100.00 

100.00 

49.40 

2.20 

E41 1 

UNEMPLOYMENT 

2136 

92.00 

90.60 

10.40 

12.00 

E51 

TRADE/RESERVES 

4015 

85.10 

75.50 

38.70 

1.90 

E51 1 

BALANCE OF PAYMENTS 

2933 

63.80 

43.70 

8.20 

25.70 

E512 

MERCHANDISE TRADE 

12634 

64.90 

59.10 

11.50 

11.70 

E513 

RESERVES 

2290 

30.10 

22.70 

1.30 

16.80 

E61 

HOUSING STARTS 

391 

51.70 

47.80 

13.80 

0.80 

E71 

LEADING INDICATORS 

5270 

2.90 

0.60 

2.40 

0.20 


Key: MM = Mixed membership, CB Lx = Cross-branch MM at level x 
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TABLE 14.3 

Topic membership statistics, continued. 


Topic code 

Topic name 

# docs 

Any MM 

CB LI MM 

CB L2 MM 

CB L3 MM 

GCAT 

GOVERNMENT/SOCIAL 

24546 

2.50 

2.50 

0.50 

0.10 

G15 

EUROPEAN COMMUNITY 

1545 

16.10 

6.90 

14.60 

0.00 

G151 

EC INTERNAL MARKET 

3307 

98.00 

87.20 

10.60 

94.30 

G152 

EC CORPORATE POLICY 

2107 

96.70 

90.70 

40.30 

50.30 

G153 

EC AGRICULTURE POLICY 

2360 

96.10 

94.20 

31.40 

27.70 

G154 

EC MONETARY/ECONOMIC 

8404 

98.20 

93.00 

11.50 

43.90 

G155 

EC INSTITUTIONS 

2124 

70.80 

42.00 

24.30 

54.00 

G156 

EC ENVIRONMENT ISSUES 

260 

75.00 

57.70 

28.80 

50.80 

G157 

EC COMPETITION/SUBSIDY 

2036 

100.00 

99.80 

60.20 

32.50 

G158 

EC EXTERNAL RELATIONS 

4300 

80.70 

62.80 

27.00 

24.80 

G159 

EC GENERAL 

40 

47.50 

17.50 

35.00 

2.50 

GCRIM 

CRIME, LAW ENFORCEMENT 

32219 

79.50 

41.60 

59.40 

0.90 

GDEF 

DEFENCE 

8842 

93.70 

17.20 

84.40 

0.50 

GDIP 

INTERNATIONAL RELATIONS 

37739 

73.70 

20.50 

60.70 

0.90 

GDIS 

DISASTERS AND ACCIDENTS 

8657 

75.70 

40.10 

52.20 

0.20 

GENT 

ARTS, CULTURE, ENTERTAINMENT 

3801 

68.80 

29.20 

49.60 

0.50 

GENV 

ENVIRONMENT AND NATURAL WORLD 

6261 

90.20 

51.50 

72.30 

2.50 

GFAS 

FASHION 

313 

76.40 

45.70 

41.50 

1.90 

GHEA 

HEALTH 

6030 

81.90 

56.10 

65.00 

1.20 

GJOB 

LABOUR ISSUES 

17241 

99.60 

99.40 

44.60 

3.30 

GMIL 

MILLENNIUM ISSUES 

5 

100.00 

100.00 

40.00 

0.00 

GOBIT 

OBITUARIES 

844 

99.40 

15.30 

99.40 

0.00 

GODD 

HUMAN INTEREST 

2802 

60.70 

9.70 

55.20 

0.10 

GPOL 

DOMESTIC POLITICS 

56878 

79.60 

29.70 

63.00 

1.80 

GPRO 

BIOGRAPHIES, PERSONALITIES, PEOPLE 

5498 

87.50 

10.00 

84.70 

0.10 

GREL 

RELIGION 

2849 

86.10 

6.60 

84.30 

0.10 

GSCI 

SCIENCE AND TECHNOLOGY 

2410 

55.20 

22.20 

45.10 

0.30 

GSPO 

SPORTS 

35317 

1.30 

0.60 

0.90 

0.00 

GTOUR 

TRAVEL AND TOURISM 

680 

89.60 

69.70 

34.70 

3.40 

GVIO 

WAR, CIVIL WAR 

32615 

67.30 

10.10 

64.60 

0.10 

GVOTE 

ELECTIONS 

11532 

100.00 

13.30 

100.00 

1.30 

GWEA 

WEATHER 

3878 

73.90 

46.80 

46.40 

0.10 

GWELF 

WELFARE, SOCIAL SERVICES 

1869 

95.40 

75.50 

74.10 

3.40 

MCAT 

MARKETS 

894 

81.10 

81.10 

14.50 

2.20 

Mil 

EQUITY MARKETS 

48700 

16.30 

12.30 

3.90 

2.90 

M12 

BOND MARKETS 

26036 

21.30 

15.60 

5.20 

3.50 

M13 

MONEY MARKETS 

447 

65.80 

51.90 

23.30 

1.60 

M131 

INTERBANK MARKETS 

28185 

15.10 

9.40 

0.70 

6.40 

Ml 32 

FOREX MARKETS 

26752 

36.90 

24.70 

3.10 

16.10 

M14 

COMMODITY MARKETS 

4732 

18.00 

16.70 

2.30 

0.10 

M141 

SOFT COMMODITIES 

47708 

24.10 

22.80 

5.50 

2.00 

M142 

METALS TRADING 

12136 

34.70 

19.30 

4.10 

16.10 

M143 

ENERGY MARKETS 

21957 

21.10 

18.40 

4.80 

2.90 


Key: MM = Mixed membership, CB Lx = Cross-branch MM at level x 


Figure 14.4 displays words in terms of their frequency (on the X axis) and exclusivity (on the 
Y axis). The two panels correspond to two different topics, namely science and technology and 
research and development, and exclusivity scores are computed for each of these topics compared to 
their sibling topics in the topic hierarchy. We will refer to this plot as the FREX plot in the following. 
The left panel features the Science and Technology topic, a child in the grab bag Government/Social 
branch; the right panel features the Research/Development topic, a child in the Corporate branch. 
The overall shape of the joint posterior is very similar for both topics. On the left side of the plots, the 
exclusivity of rare words is unable to significantly exceed the baseline. This is because the model 
does not have much evidence to estimate usage in the topic, so the estimated rate is shrunk heavily 
toward the parent rate. However, we see that it is possible for rare words to be underexpressed in 
a topic, which happens if they are frequent and overexpressed in a sibling topic. Even though their 
rates are similar to the parent in this topic, sibling topics may have a much higher rate and account 
for most appearances of the word in the comparison group. 
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FIGURE 14.3 

Exclusivity as a function of differential usage parameters. 


Differential-Exclusivity plot for MARKETS 


Differential-Exclusivity plot for PERFORMANCE 



14.4.4 A Better Two-Dimensional Summary of Semantic Content 

Words in the upper right of the FREX plot — those that are both frequent and highly exclusive — 
are of greatest interest. These are the most common words in the corpus that are also likely to 
have been generated from the topic of interest (rather than similar topics). We show words in the 
upper 5% quantiles in both dimensions for our example topics in Figure 14.5. These high-scoring 
words can help to clarify content even for labeled topics. In the Science and Technology topic, we 
see almost all terms are specific to the American and Russian space programs. Similarly, in the 
Research/Technology topic, almost all terms relate to clinical trials in medicine or to agricultural 
research. 

We also compute the FREX score for each word-topic pair, a univariate summary of topical 
content that averages performance in both dimensions. In Table 14.4 we compare the top FREX 
words in three topics to a ranking based on frequency alone, which is the current practice in topic 
modeling. For context, we also show the immediate neighbors of each topic in the tree. The topic 
being examined is in bolded red, while the borders of the comparison set are solid. The Defense 
Contracts topic is a special case since it is an only child. In these cases, we use a comparison to the 
parent topic to calculate exclusivity. 

By incorporating exclusivity information, FREX-ranked lists include fewer words that are used 
similarly everywhere (such as said and would) and fewer words that are used similarly in a set of 
related topics (such as price and market in the Markets branch). One can understand this result by 
comparing the rankings for known stopwords from the SMART list to other words. In Figure 14.6, 
we show the maximum ECDF ranking for each word across topics in the distribution of frequency 
(left panel) and exclusivity (right panel) estimates. One can see that while stopwords are more likely 
to be in the extreme quantiles of frequency, very few of them are among the most exclusive words. 
This prevents general and context-specific stopwords from ranking highly in a FREX-based index. 
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FIGURE 14.4 

Frequency-Exclusivity (FREX) plots. 
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FIGURE 14.5 

Upper right corner of FREX plot. 


Upper 5% of FREX plot for SCIENCE AND TECHNOLOGY 
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FIGURE 14.6 

Comparison of FREX score components for SMART stopwords vs. regular words. 


Density of maximum frequency ECDF over all topics 


Density of maximum exclusivity ECDF over all topics 




14.4.5 Classification Performance 

We compare the classification performance of HPC with SVM and L2-regularized logistic regres- 
sion. All methods were trained on a random sample of 15% of the documents using the 3% most 
frequent words in the corpus as features. These fits were used to predict memberships in the withheld 
documents, an experiment we repeated ten times with a new random sample as a training set. Ta- 
ble 14.5 shows the results of our experiment, using both micro averages (every document weighted 
equally) and macro averages (every topic weighted equally). While HPC does not dominate other 
methods, on average its performance does not deviate significantly from traditional classification 
algorithms. 

HPC is not designed for optimizing predictive accuracy out-of-sample, rather it is designed to 
maximize interpretability of the label-specific summaries, in terms of words that are both frequent 
and exclusive. These results offer a quantitative illustration of the classical trade-off between pre- 
dictive and explanatory power of statistical models (Breiman, 2001). 


14.5 Concluding Remarks 

Our thesis is that one needs to know how words are used differentially across topics as well as 
within them in order to understand topical content; we refer to these dimensions of content as 
word exclusivity and frequency. Topical summaries that focus on word frequency alone are often 
dominated by stopwords or other terms used similarly across many topics. Exclusivity and frequency 
can be visualized graphically as a latent space or combined into an index such as the FREX score 
to obtain a univariate measure of the topical content for words in each topic. 

Naive estimates of exclusivity will be biased toward rare words due to sensitivity to small differ- 
ences in estimated use across topics. Existing topic models such as LDA cannot regularize differen- 
tial use due to topic normalization of usage rates; its symmetric Dirichlet prior on topic distributions 
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TABLE 14.4 

Comparison of high FREX words (both frequent and exclusive) to most frequent words (featured 
topic name in underlined bold red; comparison set in solid ovals). 
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TABLE 14.5 

Classification performance for ten-fold cross-validation. 



SVM 

L2-reg Logit 

HPC 

Micro-ave Precision 

0.711 (0.002) 

0.195 (0.031) 

0.695 (0.007) 

Micro-ave Recall 

0.706 (0.001) 

0.768 (0.013) 

0.589 (0.008) 

Macro-ave Precision 

0.563 (0.002) 

0.481 (0.025) 

0.505 (0.094) 

Macro-ave Recall 

0.551 (0.006) 

0.600 (0.007) 

0.524 (0.093) 


Standard deviation of performance over ten folds in parenthesis. 

regularizes within, not between, topic usage. While topic -regularized models can capture many 
important facets of word usage, they are not optimal for the estimands used in our analysis of 
topical content. 

HPC breaks from standard topic models by modeling topic-specific word counts as unnormal- 
ized count variates whose rates can be regularized both within and across topics to compute word 
frequency and exclusivity. It was specifically designed to produce stable exclusivity estimates in 
human-annotated corpora by smoothing differential word usage according to a semantically intelli- 
gent distance metric: proximity on a known hierarchy. This supervised setting is an ideal test case 
for our framework and will be applicable to many high value corpora such as the ACM library, IMS 
publications, the New York Times and Reuters, which all have professional editors and authors and 
provide multiple annotations to a hierarchy of labels for each document. 

HPC offers a complex challenge for full Bayesian inference. To offer a flexible framework for 
regularization, it breaks from the simple Dirichlet-multinomial conjugacy of traditional models. 
Specifically, HPC uses Poisson likelihoods whose rates are smoothed across a known topic hierarchy 
with a Gaussian diffusion and a novel mixed membership model where document label and topic 
membership parameters share a Gaussian prior. The membership model is the first to create an 
explicit link between the distribution of topic labels in a document and of the words that appear 
in a document and allow for multiple labels. However, the resulting inference is challenging since, 
conditional on word usage rates, the posterior of the membership parameters involves Poisson and 
Bernoulli likelihoods of differing dimensions constrained by a Gaussian prior. 

We offer two methodological innovations to make inference tractable. First, we design our model 
with parameters that divide cleanly into two blocks (the tree and document parameters) whose mem- 
bers are conditionally independent given the other block, allowing for parallelized, scalable infer- 
ence. However, these factorized distributions cannot be normalized analytically and are the same 
dimension as the number of topics (102 in the case of Reuters). We therefore implement a Hamil- 
tonian Monte Carlo conditional sampler that mixes efficiently through high-dimensional spaces by 
leveraging the posterior gradient and Hessian information. This allows HPC to scale to large and 
complex topic hierarchies that would be intractable for random walk Metropolis samplers. One un- 
resolved bottleneck in our inference strategy is that the Markov chain Monte Carlo sampler mixes 
slowly through the hyperparameter space of the documents — the 77 and A 2 parameters that control 
the mean and sparsity of topic memberships and labels. This is due to a large fraction of missing in- 
formation in our augmentation strategy (Meng and Rubin, 1991). Conditional on all the documents’ 
topic affinity parameters these hyperparameters index a normal distribution with D ob- 

servations; marginally, however, we have much less information about the exact loading of each 
topic onto each document. While we have been exploring more efficient data augmentation strate- 
gies such as parameter expansion (Liu and Wu, 1999), we have not found a workable alternative to 
augmenting the posterior with the entire set of {£ C ;}/ / A 1 parameters. 
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14.5.1 Toward Automated Evaluation of Topic Models 

While HPC was developed for the specific case of hierarchically labeled document collections, this 
framework can be readily extended to other types of document corpora. For labeled corpora where 
no hierarchical structure on the topics is available, one can use a flat hierarchy to model differential 
use. For document corpora where no labeled examples are available, a simple word rate model with 
a flat hierarchy and dense topic membership structure could be employed to get more informative 
summaries of inferred topics. In either case, the word rate framework could be combined with 
nonparameteric Bayesian models that infer hierarchical structure on the topics (Adams et al., 2010). 
We expect modeling approaches based on rates will play an important role in future work on text 
summarization. 

The HPC model can also be leveraged to semi-automate the construction of topic ontologies tar- 
geted to specific domains, for instance, when fit to comprehensive human-annotated corpora such 
as Wikipedia, the New York Times , Encyclopedia Britannica, or databases such as JSTOR and the 
ACM repository. By learning a probabilistic representation of high quality topics, HPC output can 
be used as a gold standard to aid and evaluate other learning methods. Targeted ontologies have 
been a key factor in monitoring scientific progress in biology (Ashburner et ah, 2000; Kanehisa and 
Goto, 2000). A hierarchical ontology of topics would lead to new metrics for measuring progress 
in text analysis. It would enable an evaluation of the semantic content of any collection of inferred 
topics, thus finally allowing for a quantitative comparison among the output of topic models. Cur- 
rent evaluations are qualitative, anecdotal, and unsatisfactory; for instance, authors argue that lists 
of most frequent words describing an arbitrary selection of topics inferred by a new model make 
sense intuitively, or that they are better then lists obtained with other models. 

In addition to model evaluation, a news-specific ontology could be used as a prior to inform 
the analysis of unstructured text, including Twitter feeds, Facebook wall posts, and blogs. Unsu- 
pervised topic models infer a latent topic space that may be oriented around unhelpful axes, such 
as authorship or geography. Using a human-created ontology as a prior could ensure that a useful 
topic space is discovered without being so dogmatic as to assume that unlabeled documents have 
the same latent structure as labeled examples. 


Appendix: Implementing the Parallelized HMC Sampler 
Hamiltonian Monte Carlo Conditional Updates 

Hamiltonian Monte Carlo (HMC) is the key tool that makes high-dimensional, non-conjugate up- 
dates tractable for our Gibbs sampler. It works well for log densities that are unimodal and have 
relatively constant curvature. We outline our customized implementation of the algorithm here; a 
general introduction can be found in Neal (201 1). 

HMC is a version of the Metropolis-Hastings algorithm that replaces the common Multivariate 
Normal proposal distribution with a distribution based on Hamiltonian dynamics. It can be used to 
make joint proposals on the entire parameter space or, as in this paper, to make proposals along 
the conditional posteriors as part of a Gibbs scan. While it requires closed form calculation of 
the posterior gradient and curvature to perform well, the algorithm can produce uncorrelated or 
negatively correlated draws from the target distribution that are almost always accepted. 

A consequence of classical mechanics, Hamiltonian’s equations can be used to model the move- 
ment of a particle along a frictionless surface. The total energy of the particle is the sum of its 
potential energy (the height of the surface relative to the minimum at the current position) and its 
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kinetic energy (the amount of work needed to accelerate the particle from rest to its current veloc- 
ity). Since energy is preserved in a closed system, the particle can only convert potential energy to 
kinetic energy (or vice versa) as it moves along the surface. 

Imagine a ball placed high on the side of the parabola f(q) = q 2 at position q = —2. Starting out, 
it will have no kinetic energy but significant potential energy due to its position. As it rolls down the 
parabola toward zero, it speeds up (gaining kinetic energy), but loses potential energy to compensate 
as it moves to a lower position. At the bottom of the parabola the ball has only kinetic energy, which 
it then translates back into potential energy by rolling up the other side until its kinetic energy is 
exhausted. It will then roll back down the side it just climbed, completely reversing its trajectory 
until it returns to its original position. 

HMC uses Hamiltonian dynamics as a method to find a distant point in the parameter space 
with high probability of acceptance. Suppose we want to produce samples from f(q), a possibly 
unnormalized density. Since we want high probability regions to have the least potential energy, 
we parameterize the surface the particle moves along as U(q) = — log f(q), which is the height 
of the surface and the potential energy of the particle at any position q. The total energy of the 
particle, H (p. q), is the sum of its kinetic energy, K ip), and its potential energy, U (q), where p is 
its momentum along each coordinate. After drawing an initial momentum for the particle (typically 
chosen as p ~ A/"(0, M), where M is called the mass matrix ), we allow the system to evolve for 
a period of time — not so little that the there is negligible absolute movement, but not so much that 
the particle has time to roll back to where it started. 

HMC will not generate good proposals if the particle is not given enough momentum in each 
direction to efficiently explore the parameter space in a fixed window of time. The higher the cur- 
vature of the surface, the more energy the particle needs to move to a distant point. Therefore the 
performance of the algorithm depends on having a good estimate of the posterior curvature H(q) 
and drawing p ~ A/"(0, —H(q)). If the estimated curvature is accurate and relatively constant 
across the parameter space, the particle will have high initial momentum along directions where the 
posterior is concentrated and less along those where the posterior is more diffuse. 

Unless the (conditional) posterior is very well-behaved, the Hessian should be calculated at the 
log-posterior mode to ensure positive definiteness. Maximization is generally an expensive opera- 
tion, however, so it is not feasible to update the Hessian every iteration of the sampler. In contrast, 
the log-prior curvature is very easy to calculate and well-behaved everywhere. This led us to de- 
velop the scheduled conditional HMC sampler (SCHMC), an algorithm for nonconjugate Gibbs 
draws that updates the log-prior curvature at every iteration but only updates the log-likelihood cur- 
vature in a strategically chosen subset of iterations. We use this algorithm for all non-conjugate 
conditional draws in our Gibbs sampler. 

Specifically, suppose we want to draw from the conditional distribution p(0\i/j t ,y) oc 
p(y\9 , in each Gibbs scan, where xjt is a vector of the remaining parameters and y 

is the observed data. Let S be the set of full Gibbs scans in which the log-likelihood Hessian in- 
formation is updated (which always includes the first). For Gibbs scan i £ S, we first calculate the 
conditional posterior mode and evaluate both the Hessian of the log-likelihood, logp(y|0, t/> ( ), and 
of the log-prior, logp(0|'t/? t ), at that mode, adding them together to get the log-posterior Hessian. 
We then get a conditional posterior draw with HMC using the negative Hessian as our mass matrix. 
For Gibbs scan i f S , we evaluate the log-prior Hessian at the current location and add it to our 
last evaluation of the log-likelihood Hessian to get the log-posterior Hessian. We then proceed as 
before. The SCHMC procedure is described in step-by-step detail in Algorithm 1. 

SCHMC Implementation Details for the HPC Model 

In the previous section we described our general procedure for obtaining samples from unnormal- 
ized conditional posteriors, the SCHMC algorithm. In this section, we provide the gradient and 
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input : Ot— 1 , i/’t (current value of other parameters), y (observed data), L (number of leapfrog steps), e (stepsize), and 
S (set of full Gibbs scans in which the likelihood Hessian is updated) 

output : Ot 

0q -4— Ot— i /^Update conditional likelihood Hessian if iteration in 
schedule *1 

if i g S then 

0 <— argmaxg {logp(y|0, tf>t.) + logp(0|t/>t)} HfO) e- gg 9 ggT [logp(y|0, \ g=g 

end 

/*Calculate prior Hessian and set up mass matrix */ 

Hp(0) «- [logp(0|V> t )] \ g=g , H{0) t- HfO) + H p (0) M <- -H(0) 

/*Draw initial momentum */ 

DrawpJ ~ A/”(0, M) /*Leapfrog steps to get HMC proposal*/ 

for l <— 1 to L do 

91 «- [logp(9|V’t,y)] le=ef_ x P^.i Pi-i ~ §9i 0 i* e i-i + e ( M ~ 1 ) T Pi,i 92 «- 

--§e [ 1 °gP( 0 IV’t,y)] I P* «- P*,1 - §92 

end 

/*Calculate Hamiltonian (total energy) of initial position */ 

Kt - ! 4- i(pS) T M- 1 pS t^t -1 «- -logp(6»5|V> t ,y) if t _i «- JsT t _i + C/ t _i 

/*Calculate Hamiltonian (total energy) of candidate position */ 

I<* «- |(p*) t M- 1 p! u * «- — log p(9|, It/’t, y) «- X* + U* 

/*Metropolis correction to determine if proposal accepted*/ 

Draw u Unif[0, 1] log r «— Ht— 1 — H* if log u < log r then 

Ot «- o* L 

else 

0 t «- Ot -1 

end 

Algorithm 1: Scheduled conditional HMC sampler for iteration i. 

Hessian calculations necessary to implement this procedure for the unnormalized conditional den- 
sities in the HPC model, as well as strategies to obtain the maximum of each conditional posterior. 

Conditional Posterior of the Rate Parameters 

The log conditional posterior of the rate parameters for one word is: 
logp(fx f \W, I, f 7 2 , is, a 2 , {£d}d=i> T) 

D 

= ^2^ogPois{w fd \l d e^/3 f ) +\ogN(n f \i!)l, A(7 2 ,t^,T)) 

d= 1 

D D 

= -J2 ld9 dPf + '52 w f dl °g( d dPf) - 2 (Hf 

d—1 d—1 
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Since the likelihood is a function of (3f , we need to use the chain rule to get the gradient in /if 
space: 


d 

dpf 


\ogp{p f \W, I, f {t} }J =1 , 7 2 , T) 

_ dl(Pf) d(3 f , d 


d/3 f dp f dp f 


log p(p f | {r f } /=]. ? Vh 7 ,T) 


D 


D 


= - ld ( d d ° Pj ) + ( WfT ) ° z 3 / ) - A (/ff - V 11 ). 

d=l d =l^ U dPfJ 

where o is the Hadamard (entrywise) product. The Hessian matrix follows a similar pattern: 


H{\ogp{p f \W = -& 2 W& o p f f3} + G - A, 
where 

W = diag ( 


I w fd 

l(0j/3/) 2 


D 


d=l 


and 


G = diag 


dl(Pf) 

dP f 




We use the BFGS algorithm with the analytical gradient derived above to maximize this density 
for iterations where the likelihood Hessian is updated; this quasi-Newton method works well since 
the conditional posterior is unimodal. The Hessian of the likelihood in (3 space is clearly negative 
definite everywhere since © T W® is a positive definite matrix. The prior Hessian A is also positive 
definite by definition since it is the precision matrix of a Gaussian variate. However, the contribution 
of the chain rule term G can cause the Hessian to become indefinite away from the mode in p space 
if any of the gradient entries are sufficiently large and positive. Note, however, that the conditional 
posterior is still unimodal since the logarithm is a monotone transformation. 


Conditional Posterior of the Topic Affinity Parameters 

The log conditional posterior for the topic affinity parameters for one document is: 

logp(| d | W, /, 2) 

V 

= l d y logPoisfut/rfl/Tf 6d) + log Bernoulli (Id | £d) + log Af(£d\v, E) 

/=i 

V V K 

= -Id^pPfOd + ^wfdlog {Pjdd) - ^log(l + exp(-£ dfc )) 

/=! /—I fe= 1 

K 1 

- “ Idk)€dk - 2 (£d~ V ) T S 1 (£d - v)- 

k = 1 

Since the likelihood of the word counts is a function of Od , we need to use the chain rule to get 
the gradient of the likelihood in £,/ space. This mapping is more complicated than in the case of the 
Pf parameters since each £ d k is a function of all elements of Op. 

Vld{£d) = Vl d (0 d f J(G d ^ id), 
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where J (G d — > £,/) is the Jacobian of the transformation from G space to £ space, a K x K 
symmetric matrix. Let S = ex P £<u- Then 

S exp - exp 2£ dl ... - exp(£ dK + £ dl ) 

— exp(^i + ^ 2 ) ••• -exp (£ d K + £d 2 ) 

J(e d ^z d ) = s~ 2 . . 

- exp(^ d i + £ dK ) ... S exp £ dK - exp 2£ dK 

The gradient of the likelihood of the word counts in terms of 6 d is 

VW = + 

/=! /=! d 

Finally, to get the gradient of the full conditional posterior, we add the gradient of the likelihood of 
the labels and of the normal prior on the £ d : 

log p(UW,I,l,{Hf}%i,V,^) 

= X7l d (G d ) T J(6 d &) + (1 + exp^)" 1 - (1 - I d ) - E- 1 ^ - v )• 

The Hessian matrix of the conditional posterior is a complicated tensor product that is not effi- 
cient to evaluate analytically. Instead, we compute a numerical Hessian using the analytic gradient 
presented above at minimal computational cost. 

We use the BFGS algorithm with the analytical gradient derived above to maximize this density 
for iterations where the likelihood Hessian is updated. We have not been able to show analytically 
that this conditional posterior is unimodal, but we have verified this graphically for several docu- 
ments and have achieved very high acceptance rates for our HMC proposals based on this Hessian 
calculation. 

Conditional Posterior of the T % Hyperparameters 

The variance parameters T h independently follow an identical scaled inverse-^ 2 with convolu- 
tion parameter v and scale parameter cr 2 , while their inverse follows a Gamma(«; T = |, A r = ) 

distribution. The log conditional posterior of these parameters is: 

v 

logp(K T ,X T \{r]}J =1 ,T) = (k t -!)EE l°g < Jf k ) 1 

/=i fceP 

1 v 

- \V\Vk t log A r - |P|Vlo g r(K r ) - — EE (■'hr 1 . 

T f= 1 k£V 

where V(T) is the set of parent topics on the tree. If we allow i € {1, . . . , N = |'P|H} to index all 
the /, k pairs and 1(k t , A t ) = p({t 2 }J =1 \k t , A t ,T), we can simplify this to 

N j N 

1(k t ,X t ) = (k t - l)^logr“ 2 - Nk t log A r - N\ogT(K T ) - — ^r^ 2 . 

l—l 1=1 

We then transform this density onto the (log k t , log A r ) scale so that the parameters are uncon- 
strained, a requirement for standard HMC implementation. Each draw of (log k t , log A r ) is then 
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transformed back to the (y, a 2 ) scale. To get the Hessian of the likelihood in log space, we calculate 
the derivatives of the likelihood in the original space and apply the chain rule: 


H\ Z(logK T ,logA r ) = 


' dl(n T , \ T ) / \2 d 2 l(n T ,\ T ) 

dn T + 0(k, t ) 2 

dn T d\ T 


\ d 1 (k t ,X t ) 
h " rAr dK T d\ T 

^ 81(k t ,\ t ) ^ d 2 l(n T ,\ T ) 


' r d\ T 


d(\ T ) 


where 


and 


V/(« T , A t ) = 


H(1(k t , A t ) = 


E^Ii lo gT: 2 - AflogA r - Nip(K T ) 

Nk t | _-2 

a t ^ (a t ) 2 2^i = i r » 


-JVf(/t T ) 

_N_ 

\r 


__ N_ 

A t 

iVKr Lv* _-2 

(A r ) 2 (A T ) 3 ^i=l T t 


Following Algorithm 1, we evaluate the Hessian at the mode of this joint posterior. This is 
easiest to find on original scale following the properties of the Gamma distribution. The first order 
condition for X T can be solved analytically: 


A t,mle(kt) = arg max < Z(k t , A t ) ^ = 


1 


rN 


N 

E- 


.-2 


We can then numerically maximize the profile likelihood of k t : 


Kt,mle = arg max 

k t 


A t,mle 



The joint mode in the original space is then (h t .mle, X t ^mle{ht,aile))- Due to the 
monotonicity of the logarithm function, the mode in the transformed space is simply 
(log k, t ^mlEi log Xt,mle)- We can be confident that the conditional posterior is unimodal: the 
Fisher information for a Gamma distribution is negative definite, and the log transformation to the 
unconstrained space is monotonic. 
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Placing documents within a hierarchical structure is a common task and can be viewed as a multi- 
label classification with hierarchical structure in the label space. Examples of such data include 
web pages and their placement in directories, product descriptions and associated categories from 
product hierarchies, and free-text clinical records and their assigned diagnosis codes. We present 
a model for hierarchically and multiply labeled bag-of-words data called hierarchically supervised 
latent Dirichlet allocation (HSLDA). Out-of-sample label prediction is the primary goal of this 
work, but improved lower-dimensional representations of the bag-of-words data are also of interest. 
We demonstrate HSLDA on large-scale data from clinical document labeling and retail product 
categorization tasks. We show that leveraging the structure from hierarchical labels improves out- 
of-sample label prediction substantially when compared to models that do not. 
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15.1 Introduction 

Documents frequently come with additional information like labels or popularity ratings. Contem- 
porary examples include the product ratings that accompany product descriptions, the number of 
“likes” that webpages have attracted, grades associated with assigned essays, and so forth. 

This chapter covers one way to jointly model documents and the labels applied to them. In 
particular we focus on our own work on modeling documents with more complicated labels that 
themselves possess some kind of structural organization. Consider typical product catalogs. They 
usually contain text descriptions of products that have been organized into hierarchical product di- 
rectories. The situation of a product into such a hierarchy (the path or paths in the product hierarchy 
that lead to it) can be thought of as a structured label. Jointly modeling the document and such a label 
is useful for automatically labeling new documents (corresponding in this example to automatically 
situating a new product in the product directory) and more. 

Collections of hierarchically labeled documents abound, text and otherwise. We will consider 
hierarchically labeled patient clinical records in later sections. Applications like situating web-pages 
in hierarchical link directories are left for others to explore. 

Because text documents are notoriously difficult to directly model we take an approach common 
to other chapters in this book. We use a mixed membership model of the text document to represent 
the document as a bag-of-words drawn from a document-specific mixture of topic distributions. 
The modeling choices we make in relating this representation to structured labels follows, as does 
its relationship to prior art. 

15.1.1 Background 

Mixed membership models, including the model upon which we build, latent Dirichlet allocation 
(LDA) (Blei et ah, 2003), have been reviewed in other chapters. The key property we exploit for 
purposes of classification is that LDA provides a way to extract a latent, low-dimensional represen- 
tation of text and other documents consisting of the frequency of word assignments to the topics 
that are assumed to have generated them. A topic is a distribution over words. Each document is a 
bag-of-words drawn from a document-specific mixture of topics. 

Building a joint model of documents and labels using this representation is not new. It was first 
introduced by Blei and McAuliffe in a paper on “supervised” latent Dirichlet allocation (SLDA) 
(Blei and McAuliffe, 2008). SLDA built on LDA by incorporating “supervision” in the form of an 
observed exponential family response variable per document. 

Latent Dirichlet Allocation 

To explain both SLDA and to set the stage for our work, it helps to introduce our notation for 
LDA. Assume that there are K topics. Let </>& ~ Diry( 7 ly) be “topic” k, i.e., a distribution over 
a finite set of words. Here, V simply labels the variables to indicate that they have to do with the 
vocabulary. The vector ly consists of all ones and has length equal to the size of the vocabulary. 
The distribution Dir is the Dirichlet distribution. The constant 7 controls the smoothness of the 
inferred topics. Larger values lead to smoother topic estimates. Let (3 \ a' ~ Dir*- ( a'ln ) be a 
“global” distribution over topics where K indicates that the distribution and variable sizes are equal 
to the number of topics K and a ' controls the relative proportion of topics globally, large a' leading 
to all topics being roughly responsible for the same number of words. Intuitively, /3 is something 
like the average topic proportion independent of any particular document. Per document d, topic 
distributions Qd \ (3, a ~ Dir/f (a/3) are modeled as being deviations from the global distribution 
over topics where larger values of a result in all documents’ topic distributions being more similar. 
We will use z n j ; \ Qd ~ Multinomial/^^) to indicate which topic generated the ?rth word of 
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document d. Drawing the ?rth word in document d from the indicated topic w n/ j, \ z n ,d, 4 >i-.k ~ 
Multinomial(0 Zn d ) completes our notation of standard LDA. 

Supervised Latent Dirichlet Allocation 

Supervised LDA adds another per document observation, a label yd- It is modeled as being gen- 
erated by a generalized linear model (GLM) yd\^d,Vi^ ~ GL.\I(z fJ r. i). 5). Brushing aside ex- 
ponential family and GLM link function generalities, what SLDA does is regress labels against 
the empirical distributions of assigned topic indicator variables z d = {zi,d, ■ ■ ■ , ZK.d ] £ where 
Zk,d = f ract i° n °f words assigned to topic k and I(-) is the indicator function 

that returns one if its argument is true. If the document labels are real-valued then one example 
choice for the regression relationship would be yd\zd, V ~ A/"(zJrj, 6). Using a generalized linear 
model in the exponential family to parameterize this regression relationship allows for a wide vari- 
ety of distributions over different kinds of label spaces to be represented in the same mathematical 
formalism. A variational expectation maximization algorithm was proposed in Blei and McAuliffe 
(2008) to learn model parameters. Experimental results in Blei and McAuliffe (2008) showed both 
excellent out-of-sample label prediction and improved topics. Topic improvement was measured by 
using the empirical topic proportions from SLDA as features for external, discriminative approaches 
to label prediction. Regression models built on SLDA topics outperformed the same built on LDA 
derived topics. 

In one sense, SLDA is more general than the model we present in this chapter, namely, the 
labels need not be categorically valued. The main subject of this chapter, a generalization of SLDA 
called hierarchically supervised LDA (HSLDA) (Perotte et al., 201 1) does not deal with real-valued 
labels, however, it is more general than SLDA in the case of categorical labels. The exponential 
family/GLM regression framework can theoretically account for multivariate labels and potentially 
even structured categorical labels. HSLDA is, however, a specific, practical way to model with 
structured categorical labels. Because we focus on hierarchically structured categorical labels, we 
refer to our model as a mixed membership hierarchical classification model. 


15.2 Hierarchical Supervised Latent Dirichlet Allocation 

This model (HSLDA) is designed to fit hierarchically, multiply-labeled, bag-of-word data. We call 
groups of bag-of-words data documents (unordered words in text documents, bag of visual feature 
representations of images, etc.). Let w n ^d G She the nth observation in the dth document. Let w d = 
{w\^d, ■ ■ • , Wi,N d } be the set of Nd observations in document d. Let there be D such documents and 
let the size of the vocabulary be V = |E|. 

Let the set of labels be £ = {Zi, Z 2 , . ■ . , l\c\ }• A label in HSLDA corresponds to a node in 
the graphical model in Figure 15.1. A label can either be observed or unobserved. Documents can 
be multiply labeled, meaning that subsets of label nodes in the graphical model in Figure 15.1 
can have observed values. Each label l £ C, except the root, has a parent pa (7) £ £ also in the 
set of labels. We will, for exposition purposes, assume that this label set has hard “is-a” parent- 
child constraints (explained later), although this assumption can be relaxed at the cost of more 
computationally complex inference. Such a label hierarchy forms a multiply rooted tree. Readers 
may wish to consult Figure 15.2 or Figure 15.3, each of which is a label-tree graphical model. 

In a label forest a node may be observed (the label was “applied” and, for instance, was observed 
to have value 1), unobserved and unknown, or unobserved and constrained to be either -1 or 1 by 
the structure of the label space. Without loss of generality we will consider a tree with a single root 
r £ £. Each document has a variable y^d £ {—1, 1} for every label which indicates whether the 
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label applies to document d or not. In most cases yi.d will be unobserved; in some cases we will 
be able to fix its value because of constraints on the label hierarchy, and in the relatively minor 
remainder its value will be observed. In the applications we demonstrate, only positive labels are 
observed. This may not be true of all applications, however, positive-only label imbalance is a 
common problem. How we solve this problem will be discussed later. 

The constraints imposed by an is-a label hierarchy are that if the /th label is applied to document 
d, i.e., yi d = 1, then all labels in the label hierarchy up to the root are also applied to document d, 
i.e., t/ pa (i),d = 1, 2/pa(pa(0),rf = 1> • • • > Ur.d = 1- Conversely, if a label l 1 is marked as not applying 
to a document (i.e., yc & = —1) then no descendant label of that label can take value 1. We assume 
that at least one label is applied to every document. This is illustrated in Figure 15.1 where the 
root label is always applied but only some of the descendant labelings are observed as having been 
applied (diagonal hashing indicates that potentially some of the plated variables are observed). 



FIGURE 15.1 

Hierarchically supervised latent Dirichlet allocation (HSLDA) graphical model. 

In HSLDA, the bag-of-word document data is modeled using LDA with full, hierarchical topic 
estimation (i.e., global topic proportions are also estimated). Label responses are modeled using a 
conditional hierarchy of probit regressors and will be discussed next. The full HSLDA graphical 
model is given in Figure 15.1. 

15.2.1 Generative Model 

In the following box the HSLDA generative model is given for the “is-a hierarchy” set of label con- 
straints. In the box and what follows in this chapter, K is the number of LDA “topics” (distributions 
over the elements of E), fk is a distribution over “words,” 0,i is a document-specific distribution 
over topics, j3 is a global distribution over topics, Dirjj-(-) is a if -dimensional Dirichlet distribution, 
Mk{-) is the A'-dimensional Normal distribution, I; N - is the K dimensional identity matrix, 1,/ is 
the d-dimensional vector of all ones, and I(-) is an indicator function that takes the value 1 if its 
argument is true and 0 otherwise. 
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HSLDA Generative Model 

1 . For each topic k = 1, ... ,K 

• Draw a distribution over words <fij. ~ Dny ( 71 ^). 

2. For each label l € £ 

• Draw a label weight vector 77 / | p., a ~ 7VV(/il k, <tIk)- 

3. Draw the global topic proportions f3 \ a! ~ Dir/<- (a'l k)- 

4. For each document d = 1 , ... ,D 

• Draw topic proportions 6 d \ (3, a ~ Dir^' (a/3). 

• For n = 1 , . . . , Nd 

- Draw topic assignment z U} d \ &d ~ Multinomial (0d). 

- Draw word w n ^ \ z nt d, 4 >i-.k ~ Multinomial(0 Zn d ). 

• Set y rjd = 1. 

• For each label l in a breadth first traversal of £ starting at the children of root r 

- Draw 


^l.d. | 7]i , t/pa(/),d 

2/pa(0,d = l 

„ < 0), 2/ pa (i),d = -1. 

- Apply label l to document d according to o./ ,/ 


2/z,d I 


1 if o.i d > 0 

— 1 otherwise. 


(15.1) 


(15.2) 


Here zj = [zi, . . . , Zk, ■ ■ . , Zk] is the empirical topic distribution for document d, in which 
each entry is the percentage of the words in that document that come from topic k, Zk = 
Nf 1 H z n,d = k). As in Blei and McAuliffe (2008), the response variables are directly de- 
pendent on zj because this directly couples the topic assignments used to explain the words and the 
topic assignments used to explain the responses. 

The second half of Step 4 is what is referred to as supervision in the supervised LDA literature. 
This is where the hierarchical classification of the bag-of-words data takes place and the is-a label 
constraints are enforced. For every label l £ £. both the empirical topic distribution for document 
d and whether or not its parent label was applied (i.e., I (y pa .(i),d = 1 )) are used to determine 
whether or not label l is to be applied to document d as well. Equations (15.1) and (15.2) comprise 
a probit regression model in an auxiliary variable formulation (see Appendix). Note that in the 
case that the parent label is applied, i.e., t/ pa (;) d = 1 , the child label yi d is applied with probability 
P(z Jr/i > 0) . This is a conditional probit regression model for classification where r// are the class- 
conditional regression parameters. The auxiliary variables a^d make inference tractable but are not 
fundamental to the model — only the labels and regression parameters are actually of interest. 

Note that yi d can only be applied to document d (set to 1) if its parent label pa(7) is also applied 
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(these expressions are specific to is-a constraints but could be modified to accommodate different 
constraints between labels). Note that multiple labels can be applied to the same document. The re- 
gression coefficients rp which generate the labels are independent a priori, however, the hierarchical 
coupling in this model and conditional label dependency structure induces a posteriori dependence. 
The net effect of this conditional hierarchy of profit regressors is that child label predictors deeper in 
the label hierarchy are able to focus on finding features that distinguish label paths in the tree, con- 
ditioned on the fact that all the children of any particular node are by design members of some more 
general parent set. One can restrict this hierarchy to a depth of one hierarchy, recovering SLDA 
with probit link and univariate categorical labels. Also, one can nearly as easily make the condi- 
tional classification at each node multi-class rather than single-class if more than one label at each 
node is required. In many cases, however, a binary indicator along with a deeper or more complex 
tree is sufficient. 

Note that the choice of variables apd and how they are distributed were driven at least in part 
by posterior inference efficiency considerations (see Appendix). In particular, choosing probit-style 
auxiliary variable distributions for the apf s yields conditional posterior distributions for both the 
auxiliary variables (15.5) and the regression coefficients (15.4), which are analytic. This simplifies 
posterior inference substantially. A review of probit regression can be found near the end of this 
chapter in the Appendix. 

15.2.2 Dealing with Label Imbalance 

In the common case where no negative labels are observed (like the example applications we con- 
sider in Section 15.4), the model must be explicitly biased towards generating negative labels in 
order to keep it from learning to only assign positive labels to all documents. This is a common 
problem in modeling with unbalanced labels. To see how this model can achieve this we draw the 
reader’s attention to the p parameter and, to a lesser extent, the er parameter. Because z (/ is always 
positive, setting p to a negative value results in a bias towards negative labelings, i.e., for large neg- 
ative values of p, all labels become a priori more likely to be negative (i)i a = —1). We explore the 
effect of p on out-of-sample label prediction performance in Section 15.4. In a very real way, p is 
a knob that can be adjusted both before inference to induce a broad array of out-of-sample perfor- 
mance characteristics that vary along classical axes like specificity, recall, and accuracy. A similar 
but less principled solution can be effected by changing the decision boundary from 0 in (15.1) and 
(15.2). This technique can be used to vary out-of-sample label bias after learning. 

15.2.3 Intuition 

To help ground this abstract graphical model, recall the retail data example application. We as- 
serted that retailers often have both a browseable product hierarchy and free-text descriptions for 
all products they sell. The situation of each product in a product hierarchy (often multiply situated) 
constitutes a multiple, hierarchical labeling y d of the free-text product descriptions w,/ for all prod- 
ucts d. Note that a single product can be placed in the hierarchy in multiple places. This corresponds 
to multiple paths in the label hierarchy having labels that are all applied. HSLDA assumes that the 
free-text descriptions of all of the products in a particular node in the product hierarchy must be 
related. It also assumes that products deeper in the product hierarchy are described using language 
that is similar to that used to describe products in their parent classes. For instance, basketballs are 
probably described using language that is similar to that used to describe other basketballs, other 
balls, and more general sporting goods. In both the lay and technical senses, similar products should 
have product descriptions that share topics. If topic proportions are indicative of the text describing 
products that are grouped together, the key HSLDA assumption is that it should then be possible 
to use those proportions to decide (via probit classification) whether or not a particular product 
should be situated at a particular node in the product hierarchy. Conversely, that certain groups of 
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products are known to be clustered together should inform the kinds of topics that are inferred from 
the product descriptions. 


15.3 Inference 

Our inference goal is to obtain a representation of the posterior distribution of the latent variables 
in the model. This posterior distribution can then be used for predictive inference of labels for 
held-out documents, among other things. Unfortunately, the posterior distribution we seek does not 
have a simple analytic form from which exact samples can be drawn. This is usually the case for 
posterior distributions of non-trivial probabilistic models and suggests approximating the posterior 
distribution by sampling. 

In this section we derive the conditional distributions required to sample from the HSLDA pos- 
terior distribution using Markov chain Monte Carlo. The HSLDA sampler, like the collapsed Gibbs 
samplers for LDA (Griffiths and Steyvers, 2004), is itself a collapsed Gibbs sampler in which all of 
the latent variables that can be analytically marginalized are. Among others, the topic distributions 
4>\-k and document-specific topic assignment distributions are analytically marginalized prior 
to deriving the following conditional distributions for sampling. 

It will usually be the case that values yid will not be known for all labels l £ £ in the space of 
possible labels. Values for yid that are enforced by label constraints and observed labels are set to 
their constrained values prior to inference and treated as observed. We will define £d to be the subset 
of labels which have been observed (or observed via filling in from constraints) for document d. 
Marginalizing the probit regression auxiliary variables a/y and yy d for l' £ £\£d is simple in the 
is-a hierarchy case because they can simply be ignored. The remaining latent variables (those that are 
not collapsed out) are the topic indicators z = {zi.N d ,d}d=i,...,D, the probit regression parameters 
V = the auxiliary variables a = {aiyd}i’eC d ,d=i,...,D, the global topic proportions /3, and 

the concentration parameters a, a ' , and 7 . 


15.3.1 Gibbs Sampler 

Let a be the set of all probit regression auxiliary variables, w the set of all words, 77 the set of all 
regression coefficients, and z \z n> a the set z with element z rh ,i removed. 

First we consider the conditional distribution of z n _d (the assignment variable for each word 
n = 1, . . . , N rt in documents d = 1, . . . , D). Following the factorization of the model (refer again 
to Figure 15.1), we can write 

P (. z n ,d I Zd\zn,d , a, w, 77 , a, (3, 7 ) 

OC p(ai,d | z,r)i)p(z n ,d I ^d\z n> d,a., w,a,/ 3 , 7 ) . 

l£Cd 

The product is only over the subset of labels Cd which have been observed for document d. By 
isolating terms that depend on z n/ i and absorbing all other terms into a normalizing constant as in 
Griffiths and Steyvers (2004) we find 


P (z n ,d 


k | z\z n>d ,SL, w, 77 , a, /3, 7) oc 


( k,-(n,d) 

V C (),d 



(n,d) 

d .(0 


+7 


k,-(n, 

(•),(•) 



I hec d ex P 



(15.3) 


k —(n d) 

where c vd ’ is the number of words of type v in document d assigned to topic k omitting the 
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nth word of document d. The subscript (•) indicates to sum over the range of the replaced variable, 
i.e., c k ' *'"/'? = T. cl,’ . Here, C d is the set of labels which are observed for document d. We 

Wn,d,{‘) w nj( i,a ’ u 

sample from (15.3) by first enumerating z n j. and then normalizing. 

The conditional posterior distribution of the regression coefficients is given by 

p(rp | z, a, o) = M{(iu S), (15.4) 

rp for l £ C. Given that rp and ap d are distributed normally, the posterior distribution of rp is 
normally distributed with mean fii and covariance S, where 

p,[ = ± (l- + Z T a,) ir 1 = Icr - 1 + Z T Z. 

Here Z is a D x I\ matrix such that row d of Z is z d, and a; = [a;,]., a;, 2 , • ■ ■ , The simplicity 

of this conditional distribution follows from the choice of probit regression (Albert and Chib, 1993); 
the specific form of the update is a standard result from Bayesian normal linear regression (Gelman 
et al., 2004). It also is a standard probit regression result that the conditional posterior distribution 
of oi d is a truncated normal distribution (Albert and Chib, 1993) (see also the Appendix). 

p(a ljd | z, Y , 77 ) 

fexp{— \ (ap d - •nj'zd)}l(ap d yi,d > 0 ) 1 ( 0 ^ < 0), y pa .(i), d = -l 
\exp {-5 ( <H,d - ’njz, d )}l(ap d yp d > 0) , t/ pa (t),d = 1 - 

HSLDA employs a hierarchical Dirichlet prior over topic assignments (i.e., (3 is estimated from 
data rather than fixed a priori). This has been shown to improve the quality and stability of inferred 
topics (Wallach et al., 2009). Sampling (3, the vector of global topic proportions, can be done using 
the “direct assignment” method of Teh et al. (2006): 

(3 | z, a' , a ~ Dir (mpyi + a 1 , m (.\ 2 + c/, . . . , trq.), k + c*0 • (15.5) 

Here, m d ,k are additional auxiliary variables that are introduced by the direct assignment method to 
sample the posterior distribution of (3. Their conditional posterior distribution is sampled according 
to 

p{md, k = m | z,m_ (djfe ),/3) = 7 T (c{\ d ,m ) ( a(3 k ) m , (15.6) 

r(a(3 k + ct u ) V ’ 7 

where s (n, m) denotes Stirling numbers of the first kind. The hyperparameters a, a 1 , and 7 are 
sampled using Metropolis-Hastings. 

It remains now to show that HSLDA works. To do so we demonstrate results from modeling 
real-world datasets in the clinical and web retail domains. These results provide evidence that the 
two views (text and labels) mutually benefit multi-label classification. That is, modeling the joint is 
better than learning topic models and hierarchical classifiers independently. 


15.4 Example Applications 

15.4.1 Hospital Discharge Summaries and ICD-9 Codes 

Despite the growing emphasis on meaningful use of technology in medicine, many aspects of medi- 
cal record-keeping remain a manual process. In the U.S., diagnostic coding for billing and insurance 
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purposes is often handled by professional medical coders who must explore a patient’s extensive 
clinical record before assigning the proper codes. 

A specific example of this involves labeling of hospital discharge summaries. These summaries 
are authored by clinicians to summarize patient hospitalization courses. They typically contain a 
record of patient complaints, findings, and diagnoses, along with treatment and hospital course. The 
kind of text one might expect to find in such a discharge summary is illustrated by this made-up 
snippet: 

History of Present Illness: Mrs. Carmen Sandiego is a 62-year-old female with a past med- 
ical history significant for diabetes, hypertension, hyperlipidemia, afib, status post MI in 
5/2010 and cholecystectomy in 3/2009. The patient presented to the ED on 7/1 1/201 1 with 
a right sided partial facial hemiparesis along with mild left arm weakness. The patient was 
admitted to the Neurology service and underwent a workup for stroke given her history of 
MI and many cardiovascular risk factors ... 

For each hospitalization, trained medical coders review the information in the discharge summary 
and assign a series of diagnoses codes. Coding follows the ICD-9-CM controlled terminology, an 
international diagnostic classification for epidemiological, health management, and clinical pur- 
poses. 1 These ICD-9 codes are organized in a rooted-tree structure with each edge representing an 
is-a relationship between parent and child such that the parent diagnosis subsumes the child diag- 
nosis. For example, the code for “Pneumonia due to adenovirus” is a child of the code for “Viral 
pneumonia,” where the former is a type of the latter. A representative sub-tree of the ICD-9 code tree 
is shown in Figure 15.2. It is worth noting that the coding can be noisy. Human coders sometimes 
disagree (Cha, 2007), tend to be more specific than sensitive in their assignments (Birman-Deych 
et ah, 2005), and sometimes make mistakes (Farzandipour et ah, 2010). 



FIGURE 15.2 

An illustration of a portion of the ICD9 hierarchy. 

An automated process would ideally produce more complete and accurate diagnosis lists. The 

1 See http://www.cdc.gov/nchs/icd/icd9cm.htm. 
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task of automatic ICD-9 coding has been investigated in the clinical domain. Methods used to solve 
this problem (besides HSLDA) range from applying manually derived coding rules to applications 
of online rule learning approaches (Crammer et al., 2007; Goldstein et ah, 2007; Farkas and Szarvas, 
2008). Many classification schemes have been applied to this problem: K-nearest neighbor, naive 
Bayes, support vector machines, Bayesian Ridge Regression, as well as simple keyword mappings, 
all with promising results (Larkey and Croft, 1995; Ribeiro-Neto et ah, 2001; Pakhomov et ah, 
2006; Lita et ah, 2008). 

The specific dataset we report results for in this chapter was gathered from the New York- 
Presbyterian Hospital clinical data warehouse. It consists of 6000 discharge summaries and their 
associated ICD-9 codes (7,298 distinct codes overall), representing a portion of the discharges from 
the hospital in 2009. All included discharge summaries had associated ICD-9 Codes. Summaries 
have 8.39 associated ICD-9 codes on average (std dev=5.01) and contain an average of 536.57 
terms after preprocessing (std dev=300.29). We split our dataset into 5000 discharge summaries for 
training and 1000 for testing. 

The text of the discharge summaries was tokenized with NLTK. 2 A fixed vocabulary was formed 
by taking the top 10,000 tokens with the highest document frequency (exclusive of names, places, 
and other identifying numbers). The study was approved by the Institutional Review Board and 
follows HIPAA (Health Insurance Portability and Accountability Act) privacy guidelines. 

Here HSLDA is evaluated as a way to understand and model the relationship between a dis- 
charge summary and the ICD-9 codes that should be assigned to it. We show promising results for 
automatically assigning ICD-9 codes to hospital discharge records. 

15.4.2 Product Descriptions and Catalogs 

Many web-retailers store and organize their catalog of products in a mulitply-rooted hierarchy in 
addition to providing textual product descriptions for most products. Products can be discovered by 
users through free-text search and product category exploration. Top-level product categories are 
displayed on the front page of the website and lower-level categories can be discovered by choosing 
one of the top-level categories. Products can exist in multiple locations in the hierarchy. 

Amazon.com is one such retailer. Its product categorization data is available as part of the Stan- 
ford Network Analysis Platform (SNAP) dataset (SNA, 2004). A representative sub-tree of the 
Amazon.com DVD product category tree is shown in Figure 15.3. Product descriptions were ob- 
tained separately from the Amazon.com website directly. Once such description is 

Winner of five Academy Awards, including Best Picture and Best Director, The Deer Hunter 
is simultaneously an audacious directorial conceit and one of the greatest films ever made 
about friendship and the personal impact of war. Like Apocalypse Now, it’s hardly a con- 
ventional battle him — the soldier’s experience was handled with greater authenticity in 
Platoon — but its depiction of war on an intimate scale packs a devastatingly dramatic punch 


We study the collection of DVDs in the product catalog specifically. The resulting dataset con- 
tains 15,130 product descriptions for training and 1000 for testing. The product descriptions consist 
of 91.89 terms on average (std dev=53.08). Overall, there are 2,691 unique categories. Products are 
assigned on average 9.01 categories (std dev=4.91). The vocabulary consists of the most frequent 
30,000 words omitting stopwords. 

HSLDA is used here to understand and model the relationship between the product text descrip- 
tion and the products’ positioning in the product hierarchy. We show how to automatically situate a 
product in a hierarchical product catalog. 


-See http://www.nltk.org. 
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FIGURE 15.3 

An illustration of a portion of the Amazon product hierarchy. 

15.4.3 Comparison Models 

We compare HSLDA to two closely related models. The comparison models are SLDA with inde- 
pendent regressors (hierarchical constraints on labels ignored, i.e., the regression is not conditional) 
and HSLDA fit by first performing LDA, then fitting probit regressors that respect the conditional 
label hierarchy (rather than jointly inferring the topics and the regression coefficients). These mod- 
els were chosen because they are the strongest available competitors and because they highlight 
several pedagogical aspects of HSLDA, including performance in the absence of hierarchical con- 
straints, the effect of the combined inference, and regression performance attributable solely to the 
hierarchical constraints. 

SLDA with independent regressors is the most salient comparison model for our work. The 
distinguishing factor between HSLDA and SLDA is the additional structure imposed on the label 
space, a distinction that in developing HDSLA we hypothesized would result in a difference in 
predictive performance. 

The second comparison model, HSLDA fit by performing LDA first followed by performing 
inference over the hierarchically constrained label space, does not allow the responses to influence 
the topics inferred by LDA. Combined inference has been shown to improve performance in SLDA 
(Blei and McAuliffe, 2008). This comparison model does not examine the value of utilizing the 
structured nature of the label space; instead, it highlights the benefit of combined inference over 
both the documents and the label space. 

For all three models, particular attention was given to the settings of the prior parameters (/i, cr) 
for the regression coefficients (;//). These parameters implement an important form of regularization 
in HSLDA. In the setting where there are no negative labels, a Gaussian prior over the regression 
parameters with a negative mean implements a prior belief that missing labels are likely to be nega- 
tive. Thus, we show model performance for all three models with a range of values for /j, the mean 
prior parameter for regression coefficients (p € {—3, —2.8, —2.6, . . . , 1}). 

The number of topics for all models was set to K = 50, the prior distributions of p (a), p ( a '), 
and p ( 7 ) were all chosen to be gamma with a shape parameter of 1 and a scale parameter of 1000 . 
Different values of K corresponding to different numbers of topics were explored, however, the 
results that we show in the following are not substantially changed in character. As is usual in 
mixed membership models, there is an ideal number of topics that should be used for out-of-sample 
prediction tasks, however, a full model-selection search varying topic cardinality was not performed 
for these datasets. 
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15.4.4 Evaluation and Results 

We are particularly interested in the predictive performance on held-out data. Prediction perfor- 
mance was measured with standard metrics — sensitivity (true positive rate) and 1 -specificity (false 
positive rate). 

In each case the gold standard for testing was derived from the test data. To make the comparison 
as antagonistic to HSLDA as possible (relative to the other models), in evaluation only, ancestors 
of observed nodes in the label hierarchy were ignored, observed nodes were considered positive, 
and descendants of observed nodes were assumed to be negative. Note that this is different from 
our treatment of the observations during inference where we marginalize over possible settings of 
unobserved labels. For instance, as the SLDA model does not enforce hierarchical label constraints, 
when we consider only observed nodes we penalize HSLDA. This is because the is-a hierarchical 
constraints say that the ancestors of positively labeled nodes must also be positive, which the SLDA 
model cannot guarantee. Another antagonism of this gold standard is that it is likely to inflate the 
number of false positives because the labels applied to any particular document are usually not as 
complete as they could be. ICD-9 codes, for instance, are known to lack sensitivity and their use as 
a gold standard could lead to correctly positive predictions being labeled as false positives (Birman- 
Deych et al., 2005). However, given that the label space is often large (as in our examples), it is a 
reasonable assumption that erroneous false positives should not skew results significantly. 

Predictive performance in HSLDA is evaluated by computing 

P{vi,d I w l:Nj,d> W l ■N d ,l-.D,yieC,l:D^j 

for each test document d for each observed label y, ; (given the test document words). For effi- 
ciency, the expectation of this probability distribution was approximated in the following way: Ex- 
pectations of zj and rji were estimated with samples from the posterior. Fixing these expectations, 
we performed Gibbs sampling over the hierarchy to acquire predictive samples for the documents 
in the test set. The true positive rate was calculated as the average expected labeling for gold stan- 
dard positive labels. The false positive rate was calculated as the average expected labeling for gold 
standard negative labels. 

As sensitivity and specificity can always be traded off, we examined sensitivity for a range of 
values for two different parameters — the prior means for the regression coefficients and the thresh- 
old for the auxiliary variables. The goal in this analysis was to evaluate the performance of these 
models subject to more or less stringent requirements for predicting positive labels. These two pa- 
rameters have important related functions in the model. The prior mean in combination with the 
auxiliary variable threshold together encode the strength of the prior belief that unobserved labels 
are likely to be negative. Effectively, the prior mean applies negative pressure to the predictions 
and the auxiliary variable threshold determines the cutoff. For each model type, separate models 
were fit for each value of the prior mean of the regression coefficients. This is a proper Bayesian 
sensitivity analysis. In contrast, to evaluate predictive performance as a function of the auxiliary 
variable threshold, a single model was fit for each model type and prediction was evaluated based 
on predictive samples drawn subject to different auxiliary variable thresholds. These methods are 
significantly different since the prior mean is varied prior to inference, and the auxiliary variable 
threshold is varied following inference. 

Figure 15.4(a) demonstrates the performance of the model on the clinical data as a ROC curve 
varying /;. For instance, a hyperparameter setting of p = — 1.6 yields the following performance: 
the full HSLDA model had a true positive rate of 0.57 and a false positive rate of 0.13, the SLDA 
model had a true positive rate of 0.42 and a false positive rate of 0.07, and the HSLDA model where 
LDA and the regressions were fit separately had a true positive rate of 0.39 and a false positive rate 
of 0.08. These points are highlighted in Figure 15.4(a). Note that the figure is somewhat misleading 
because for any one value of /;, HSLDA outperforms the comparison models by a relatively large 
margin. 
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These results indicate that the full HSLDA model predicts more of the correct labels at a cost of 
an increase in the number of false positives relative to the comparison models. However, as shown 
in Figure 15.4(a), HSLDA outperforms no worse than the comparison models across the full range 
of specificities. 




(a) Clinical data performance 


(b) Retail product performance 


FIGURE 15.4 

ROC curves for HSLDA out-of-sample label prediction varying /x, the prior mean of the regression 
parameters. In both figures, solid is HSLDA, dashed are independent regressors + SLDA (hierar- 
chical constraints on labels ignored), and dotted is HSLDA fit by running LDA first then running 
tree-conditional regressions. 


Example topics (as word lists) learned for the discharge data are given below. These word lists 
are computed by sorting terms in decreasing order based on their probability under a given topic. 


Topic 1 

Topic 2 

MASS 

WOUND 

CANCER 

FOOT 

RIGHT 

CELLULITIS 

BREAST 

ULCER 

CHEMOTHERAPY 

LEFT 

METASTATIC 

ERYTHEMA 

LEFT 

PAIN 

LYMPH 

SWELLING 

TUMOR 

SKIN 

BIOPSY 

RIGHT 

CARCINOMA 

ABSCESS 

LUNG 

LEG 

CHEMO 

OSTEOMYELITIS 

ADENOCARCINOMA 

TOE 

NODE 

DRAINAGE 


These topics closely correspond to common clinical concepts, namely cancers of the thorax and 
wounds common to diabetics suffering from poor peripheral circulation. Evaluations of the subject 
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coherence of these topics relative to baselines are ongoing, but early results suggest positive findings 
similar to those reported for other supervised LDA models. 

Figure 15.4(b) demonstrates the performance of the model on the retail product data as an ROC 
curve also varying p. For instance, a hyperparameter setting of p, = —2.2 yields the following 
performance: the full HSLDA model had a true positive rate of 0.85 and a false positive rate of 0.30, 
the SLDA model had a true positive rate of 0.78 and a false positive rate of 0.14, and the HSLDA 
model where LDA and the regressions were fit separately had a true positive rate of 0.77 and a false 
positive rate of 0.16. These results follow a similar pattern to the clinical data. These points are 
highlighted in Figure 15.4(b). 

Example topics (as word lists) learned for the Amazon.com data are given below. These word 
lists were also computed by sorting terms in decreasing order based on their probability under a 
given topic. 


Topic 1 

Topic 2 

SERIES 

BASEBALL 

EPISODES 

TEAM 

SHOW 

GAME 

SEASON 

PLAYERS 

EPISODE 

BASKETBALL 

FIRST 

SPORT 

TELEVISION 

SPORTS 

SET 

NEW 

TIME 

PLAYER 

TWO 

SEASON 

SECOND 

LEAGUE 

ONE 

FOOTBALL 

CHARACTERS 

STARS 

DISC 

FANS 

GUEST 

FIELD 


Figure 15.5 shows the predictive performance of HSLDA relative to the two comparison mod- 
els on the clinical dataset as a function of the auxiliary variable threshold. For low values of the 
auxiliary variable threshold, the models predict labels in a more sensitive and less specific manner, 
creating the points in the upper right corner of the ROC curve. As the auxiliary variable threshold is 
increased, the models predict in a less sensitive and more specific manner, creating the points in the 
lower left hand corner of the ROC curve. HSLDA with full joint inference outperforms SLDA with 
independent regressors as well as HSLDA with separately trained regression. 


15.5 Related Work 

HSLDA does not, of course, stand alone. Models for structured labeling of bag-of-words data can 
be designed in a number of different ways. 

As shown in Section 15.4, SLDA can be used to solve this kind of problem directly, how- 
ever, doing so requires ignoring the hierarchical dependencies amongst the labels. Other models 
that incorporate LDA and supervision that could also be used to solve this problem include La- 
beledLDA (Ramage et al., 2009) and DiscLDA (Lacoste-Julien et al.). Various applications of these 
models to computer vision and document networks have been explored (Wang et al., 2009; Chang 
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FIGURE 15.5 

ROC curve for out-of-sample ICD-9 code prediction varying auxiliary variable threshold. //. = 1 .0 

for all three models in this figure. 


and Blei, 2010). None of these models, however, leverage dependency structure in the label space. 

In other non-LDA-based related work, researchers have classified documents into hierarchies (a 
closely related task) using naive Bayes classifiers and support vector machines. Most of this work 
has been demonstrated on relatively small datasets and small label spaces, and has focused on single 
label classification without a model of documents such as LDA (McCallum et ah, 1999; Dumais and 
Chen, 2000; Roller and Sahami, 1997; Chakrabarti et ah, 1998). 


15.6 Discussion 

The SLDA model family, of which HSLDA is a member, can be understood in two different ways. 
One way is to see it as a family of topic models that improve on the topic modeling performance of 
LDA via the inclusion of observed supervision. An alternative, complementary way is to see it as 
a set of models that can predict labels for bag-of-word data. A large diversity of problems can be 
expressed as label prediction problems for bag-of-word data. A surprisingly large amount of data 
possess structured labels, either hierarchically constrained or otherwise. HSLDA directly addresses 
this kind of data and works well in practice. That it outperforms more straightforward approaches 
should be of interest to practitioners. 

There are many kinds of problems that have the same characteristics as this: any data that con- 
sists of free text that has been partially or completely categorized by human editors; more specifi- 
cally, any bag-of-words data that has been, at least in part, categorized. Examples include, but are 
not limited to, webpages and curated hierarchical directories of the same (DMO, 2002), product de- 
scriptions and catalogs, (e.g., AMA (2011) as available from SNA (2004)) and patient records and 
diagnosis codes assigned to them for bookkeeping and insurance purposes. The model we cover in 
this chapter shows one way to combine these two sources of information into a single model al- 
lowing one to categorize new text documents automatically, suggest labels that might be inaccurate, 
compute improved similarities between documents for information retrieval purposes, and more. 

Extensions to this work include a nonparametric Bayesian extension with unbounded topic 
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cardinality and relaxations to different kinds of label structure. Unbounded topic cardinality vari- 
ants pose interesting inference challenges. Imposing different kinds of label structure constraints is 
possible within this framework but requires relaxing some of the assumptions we made in deriving 
the sampling distributions for HSLDA inference. 


Appendix 

Probit Regression 

For reasons that are somewhat obscure, statisticians tend to use probit regression for binary classifi- 
cation whereas machine learners tend to use logistic regression. The “probit” function is the inverse 
of the normal cumulative distribution function fcdf). We denote the normal cdf function <l>(x: //,, a 2 ) 
with n the mean, o 2 the variance, and x the argument. 

The range of the normal cdf is (0, 1), which means that it can be interpreted as a probability. For 
instance, one can construct a generalized linear classification model (a “probit regression model”) 
of the form 

P( yi = l) = ^xJ/3-,0,a 2 ). (15.7) 

Depending on convention (i.e., binary y, represented as {1,0} or {1,-1}), the probability of j/,; 
being labeled the opposite way is P(yi = —1) or P(yi = 0) = 1 — P{y% = 1). Here Xi is 
a vector of covariates, 0 is a vector of weights, and ;</, is a single, binary valued response. The 
close relationship between regression and classification is in full display here: probit regression is a 
“generalized linear regression model” as well as a “binary classifier.” 

In this model we would like to use labeled training data, {#*, yi}^ =1 to “learn” the value of / 3 and 
then to use this value to predict the value of ijn+\ |xjv+i, 0 - Being Bayesian about inference means 
that we will average over the posterior distribution of 0 when making predictions. This means that 
we want to draw samples from the posterior distribution of 0\ {&*, yi}fL 1 . To do this efficiently one 
can introduce a set of auxiliary variables 

By auxiliary variables we mean that such variables will be used as an intermediary for purposes 
of efficiency but will otherwise be uninteresting. They are variables introduced into a model in order 
to make inference easier but whose existence does not change the distribution of interest. Auxiliary 
variables for slice sampling are one particularly clever use of auxiliary variables. The auxiliary 
variable trick in probit regression is another. 

For the purposes of exposition, forget about the i index and focus on a single instance y, x, and 
u. The argument we make will hold for all by simply reintroducing subscripts. 

To start, let’s propose a factorized joint distribution for these quantities 

P(y, x, u) = P(y\u)P(u\x, 0). (15.8) 

Straight away, one can see why this auxiliary variable scheme works. By the law of total proba- 
bility we have 

P(y,x) = J P(y,x,u)du = J P(y\u)P(u\x, 0)du. (15.9) 

So, if by some means we generate S samples {u^ s \ y^ s \ x^}f =1 ~ P(y,x,u), we know that 
marginalizing u out (i.e., disregarding its value) we get samples {y^ s \ x (s) }f=i ~ P(y,x). 

We haven’t specified the most important part of the auxiliary variable sampling scheme yet, 
namely, what P(y\u) and P(u\x,0) are. Let us try y = sign(w) and u ~ N(x T 0,o 2 ). These 
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choices are nice in a particular way. First let us verify that the marginalization of u out of this model 
results in the model specification in Equation (15.7): 


P(y = 1| x,/3) 


J P(y = l\u)P(u\x, (3)du 
j I(it > 0)N(u;x t / 3,a 2 )du 

pOO 

/ N(w,x t f3,a 2 )du 

Jo 

1 — $(0; x T /3, a 2 ) 

$(x T /?;0 ,cr 2 ), 


where the last line comes from the fact that for symmetric distributions like the normal distribution, 
$(x T /3; 0, a 2 ) = 1— $(— x T ft] 0, a 2 ), and the mean of a normal cdf can be translated arbitrarily, i.e., 
$(— x T 0, <j 2 ) = $(0; x T ft, cr 2 ) (which comes from adding the offset x T ft to the cdf argument 
and mean). 

Having established the fact that for a particular sort of auxiliary variable choice, we get the same 
probit model as we wanted, why is this choice nice? 

Well, it comes down to sampling j3, u, and y. Generally, sampling 3 in the model without auxil- 
iary variables will require hybrid Monte Carlo (HMC) or Metropolis-Hastings of some sort. Gibbs 
sampling often comes with substantial benefits. By making this choice of auxiliary variable, the 
conditional distribution of Ui given everything else is proportional to a truncated normal distribu- 
tion, a distribution that is, by nature of its commonness, relatively straightforward to sample from. 
The big benefit, though, acmes from the posterior form for sampling 6. With the us “observed” (as 
they would be in a Gibbs sampler), the posterior distribution of 13 (for typical choices of prior) is 
precisely the same as that for linear regression, perhaps the most well-studied model in statistics. In 
that case, sampling from its posterior distribution is quite simple usually, and certainly more so 
than sampling /? without the u auxiliary variables. 

The extension to the multivariate HSLDA setting is straightforward and follows this line of 
reasoning precisely. An extended discussion of the techniques suggested here and the multivariate 
generalization can be found in Gelman et al. (2004). 
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Although mixed membership models have achieved great success in unsupervised learning, they 
have not been applied as widely to classification problems. In this chapter, we discuss a family 
of discriminative mixed membership (DMM) models. By combining unsupervised mixed member- 
ship models with multi-class logistic regression, DMM models can be used for classification. In 
particular, we discuss discriminative latent Dirichlet allocation (DLDA) for text classification and 
discriminative mixed membership naive Bayes (DMNB) for classification on general feature vec- 
tors. Two variation inference algorithms are considered for learning the models, including a fast 
inference algorithm which uses fewer variational parameters and is substantially more efficient than 
the standard mean field variational approximation. The efficacy of the models is demonstrated by 
extensive experiments on multiple datasets. 
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16.1 Introduction 

In recent years, mixed membership (MM) models have found wide application in a variety of do- 
mains, such as topic modeling (Blei et al., 2003), bioinformatics (Airoldi et al., 2008), and social 
network analysis (Koutsourelakis and Eliassi-Rad, 2008). A key advantage of such models is that 
they provide a succinct and interpretable representation of otherwise large and high-dimensional 
datasets. However, one important restriction of most existing mixed membership models is that they 
are unsupervised models and cannot leverage class label information for classification. On the other 
hand, most of the popular classifiers, such as support vector machines (SVM) (Burges, 1998) and 
logistic regression (LR) (Pampel, 2000), are usually difficult to interpret. Therefore, an accurate dis- 
criminative classification model leveraging mixed membership models for interpretability is highly 
desirable. 

This chapter discusses discriminative 1 mixed membership (DMM) models as a classification al- 
gorithm by combining multi-class logistic regression with unsupervised mixed membership models. 
In particular, two variants are considered in this chapter — discriminative latent Dirichlet allocation 
(DLDA) and discriminative mixed membership naive Bayes (DMNB). DLDA is applicable to text 
classification and uses LDA as the underlying mixed membership model (Blei et al., 2003). DMNB 
is applicable to non-text classification involving different types (e.g., numerical, categorical) of fea- 
ture vectors and uses mixed membership naive Bayes (MNB) as the underlying mixed membership 
model (Shan and Banerjee, 2010). 

Two variational inference algorithms, as well as corresponding variational EM algorithms are 
used to learn the model. The first inference algorithm is based on the ideas originally proposed in 
the context of LDA (Blei et al., 2003). The second algorithm uses a substantially smaller number of 
variational parameters, with no dependency on the dimensionality of the dataset. By design, the new 
algorithm has substantially smaller memory requirements, and is orders of magnitude faster, where 
the speedup times roughly increase with the dimensionality of data, i.e., the higher dimension of the 
data, the more computational achievements the algorithm gains. 

The effectiveness of DMM models are established through extensive experiments on text data for 
DLDA and on UCI data for DMNB. The results show that DMM models achieve higher/competitive 
performance compared to state-of-the-art classification algorithms. More importantly, the new vari- 
ational inference algorithm used in DMM is not only faster than the one used in Blei et al. (2003), 
but also leads to higher classification accuracy. 

The rest of this chapter is organized as follows: Section 16.2 briefly reviews the related work. 
Sections 16.3 and 16.4 discuss DLDA and DMNB, respectively. Regular and fast variational infer- 
ence algorithms are introduced in Section 16.5 and 16.6. We present the experimental results for 
DMM models in Section 16.7 and conclude in Section 16.8. 


16.2 Related Work 

This section gives a brief overview for unsupervised mixed membership models — latent Dirichlet 
allocation and mixed membership naive Bayes models, and then discusses the related work on 
incorporating supervised information into mixed membership models. 


1 "Discriminative” here does not mean a discriminative model, but a generative model used for classification instead of 
clustering. 
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Naive Bayes (NB) models, or mixture models (Redner and Walker, 1984; Banerjee et al., 2005b) in 
general, assume that for each data point x, the latent component 2 is fixed across all features. While 
such an assumption is reasonable in certain domains, it puts a major restriction on the flexibility 
of naive Bayes models. Latent Dirichlet allocation (LDA)(Blei et al., 2003; Griffiths and Steyvers, 
2004) is an elegant extension of standard mixture models that relaxes this assumption in the context 
of topic modeling, where each data point is a collection of tokens, e.g., a document with a collection 
of words. LDA assumes that each word in a document potentially comes from a separate topic z. 
If there are completely k topics, z can take value from 1 to k and is generated from a discrete 
distribution discrete)^) of this document, and all documents share a fc-dimensional Dirichlet prior 
a. The generative process for each document x is as follows: 

1. Choose a mixed membership vector n ~ Dirichlet(ct). 

2. For each of m words {xj, [j]” 1 } ([j]” 1 is defined as {j = 1,2, , to}) in x: 

(a) Choose a topic (component) Zj = c ~ discrete (7r). 

(b) Choose Xj from p(xj\f3 c ). 

f3 = {/3 C , [c]}} is a collection of parameters for k component distributions, where each (3 C is a 14- 
dimensional discrete distribution given V, the total number of words in the dictionary. The density 
function of a document x is 


p{x.\ot,f3) 



c\n)p(xj\/3 c ) dn . 


(16.1) 


Computing the probability of a collection of documents is intractable, and several approximate 
inference techniques have been proposed to address the problem. The two most popular approaches 
include variational approximation (Blei et al., 2003) and Gibbs sampling (Geman and Geman, 1984; 
Griffiths and Steyvers, 2004). 


16.2.2 Mixed Membership Naive Bayes 

Although LDA achieves a good performance in topic modeling, it cannot deal with data points 
with numerical or real-valued features, or data points with heterogenous features. Mixed member- 
ship naive Bayes relaxes these limitations by introducing a separate exponential family distribu- 
tion (Barndorff-Nielsen, 1978) for each feature. It is designed to deal with sparse and heterogenous 
feature vectors. Following MNB, the generative process for the data point x can be described as 
follows: 

1. Choose a mixed membership vector n ~ Dirichlct(o). 

2. For each non-missing feature Xj of x: 

(a) Choose a component Zj = c ~ discrete (tt) . 

(b) Choose a feature value Xj ~ p^, . (xj \0j c ) . 

Here, ipj and 0j c jointly decide an exponential family distribution (Banerjee et al., 2005b; Barndorff- 
Nielsen, 1978) for feature j and component c. In particular. 


Pipj( x j\djc) — ex p (XjOjc ^j(0jc))pj(Xj) 
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where fj(- ) is the cumulant or the log-partition function, and pj (xj ) is a non-negative base measure. 
ipj(-) determines the exponential family model appropriate for feature j, e.g., Gaussian, Poisson, 
Bernoulli, etc., and 0 JC is the natural parameter corresponding to feature j and component c. 

The density function for x is given by: 


p(x|a,0) 



^ d k ^ 

n e p{zj = c\tt)p^(x 

I ®jc) 

j = 1 C—l 

\3^ / 


dir , 


(16.2) 


where 3xj denotes any observed feature j for x and 0 = {dj C , [j] f , [c]^}. 


16.2.3 Supervised LDA 

Supervised latent Dirichlet allocation (SLDA) (Blei and McAuliffe, 2007) is an extension of LDA 
which accommodates the response variables as the supervised information other than the docu- 
ments. The response variable is assumed to be generated from a normal linear model N(w T z, <r 2 ), 
where w and er 2 are the parameters and the covariates 0 = ]TL =1 2 )/M are the empirical average 
frequencies of each latent topic for the words in the document. If there are totally k components, 
each Zj is represented as a fc-dimensional unit vector with only the cth entry being 1 if it denotes 
the cth component. The generative process for SLDA is as follows: 

1. Choose a mixed membership vector 7r ~ Dirichlet(o:). 


2. For each of m words {xj, [j]™} in x: 

(a) Choose a topic (component) Zj = c ~ discrete (tt). 

(b) Choose Xj from p(xj\/3 c ). 

3. Choose a response variable y ~ N(w T z , er 2 ). 

The density function of SLDA is hence given by 


« / m k \ 

p(x\a,/3) = j p(it\a) I = c\tt)p{xj\Pc) I p{y\z,w,a 2 )dTt . 


u=i C=1 


Generally, SLDA is a combination of mixed membership models with generalized linear models 
to incorporate supervised information. In particular, the generalized linear model for SLDA to gen- 
erate the response variable y is a univariate normal linear model. Therefore, SLDA is constrained to 
deal with one -dimensional, real-valued response variables. 

Other than supervised LDA, recent years have seen quite a few extensions of incorporating su- 
pervised information into mixed membership models. Flaherty et al. (2005) proposed labeled latent 
Dirichlet allocation to incorporate functional annotation of known genes to guide gene clustering. 
Fei-Fei and Perona (2005) proposed a Bayesian model for natural scene categorization. Lacoste- 
Julien et al. (2008) proposed DiscLDA which determines document position on topic simplex 
with guidance of labels. Mimno and McCallum (2008) proposed a Dirichlet-multinomial regression 
which accommodates different types of metadata, including labels. Wang et al. (2008) proposed a 
correlated labeling model for multilabel classification. Wang et al. (2009) extended SLDA for image 
classification and annotation. Ramage et al. (2011) proposed partially labeled topic models which 
make use of unsupervised topic models but aligned some learned topics with labels. Wang and Blei 
(2011) proposed a collaborative topic regression model which uses ratings across different users 
as the supervised information for scientific articles and combines topic models and collaborative 
filtering together. 
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16.3 Discriminative LDA 

SLDA (Blei and McAuliffe, 2007) incorporates the response variable into LDA (Blei et al., 2003), 
but it has two limitations preventing it from being used as a classification algorithm: 

1 . The response variables in SLDA are univariate real numbers assumed to be generated from a 
normal linear model, whereas the response variables, i.e., labels, are discrete categories in the 
classification setting. Although Blei and McAuliffe (2007) also gave a general framework for 
other types of response variables via generalized linear models, variational inference is not as 
straightforward as in SLDA. In particular, the Taylor expansion-based approach in Blei and 
McAuliffe (2007) forgoes the lower-bound guarantee of variational inference. 

2. Like latent Dirichlet allocation, SLDA is designed for text data as a collection of homogeneous 
tokens. However, most non-text classification tasks, e.g., the UCI benchmark datasets, have 
features of heterogeneous types with measured values. SLDA is not designed for such data. 

In this and the following sections, we discuss discriminative mixed membership models, which 
combine MM models with logistic regression for classification. The underlying MM models for 
DMM include LDA and MNB, yielding discriminative LDA and discriminative MNB for text and 
numerical data, respectively. We discuss DLDA first and introduce DMNB in the next section. 

Assuming there are t classes and k components, the graphical model for DLDA is given in 
Figure 16.1(a). It is similar to LDA (Blei et al., 2003), except that it generates the label y other than 
the document x through logistic regression with parameter y = { 771 , . . . where each y^ for 
\h\\ is a A; -dimensional vector and y t is a zero vector by default. The generative process for each 
document x and label y is given as follows: 

1. Choose a mixed membership vector 7 r ~ Dirichlet(a). 

2. For each of m words {xj, [j]" 1 } in the document x, 

(a) Choose a component Zj = c ~ discrete( 7 r), c £ {1, 2, ...k}. 

(b) Choose a word Xj ~ discrete(/3 c ). 

3. Choose the label from a multi-class logistic regression y ~ LR^r^ z, y%z, . . . , yf z). 



(a) DLDA 


(b) DMNB 


FIGURE 16.1 

Graphical models for DLDA and DMNB. 
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As in SLDA, 2 is an average of z\ . . . z m over all observed words. LR (rjf 2, rfiz, . . . , pj z) de- 
notes a logistic transformation on [ 77 ^ 2, rj^z, . . . , pj z], which is equivalent to a discrete distribution 

(pi, —Pt-i, 1 - YfhJiPh) with p h = f or [fr]* -1 . In two-class classification, y is 0 or 

L +2^h=iWh z ) 

1 generated from Bernoulli( 1+cxp (_ t; T^ ), i.e., there is only one parameter 771 to be estimated, 772 is 
the zero vector by default. 

DLDA could be considered as a variant of SLDA. In SLDA, the response variable is a real 
number generated from a normal linear model. In DLDA, the response variable is a classification 
label generated from a generalized linear model (McCullagh and Nelder, 1989), in particular, from 
the multivariate logistic regression. 

From the generative model, the density function for (x, y) is given by: 


« / m k \ 

p(x, y\a, /?, 77 ) = J p(tt\a) II £ p(zj = c\tt)p(x j\fi c ) I p( y\z p)dn . ( 16 . 3 ) 


G= 1 C=1 


The probability of the entire dataset of n documents and labels (X = {x, , [i]"}, y = {y, ; , [?’]"}) is 
given by 


n « / rrii k \ 

p(X,y\a,P,p) = n / P^i\ a ) n = C \ n i)P( X ij\Pc) p(yi\zi,v)dTTi . (16.4) 

*= ljvt \^= lc=1 J 

There are two important properties of DLDA and of discriminative mixed membership models 
in general: (1) The fc-dimensional mixed membership z effectively serves as a low-dimensional 
representation of the original document. While z in LDA is inferred in an unsupervised way, it is 
obtained from a supervised dimensionality reduction in DLDA. (2) DLDA allows the number of 
classes t and the number of components k in the generative model to be different. If k was forced 
to be equal to t, for problems with a small number of classes, z would have been a rather coarse 
representation of the document. In particular, for two-class problems, z would lie on the 2-simplex, 
which may not be an informative representation for classification purposes. Decoupling the choice 
of k from t prevents such pathologies. In principle, one may find a proper k using a nonparametric 
Dirichlet process mixture model (Blei and Iordan, 2006). 

In DLDA, following Blei and McAuliffe (2007), we have used z (the mean of 2 for all words) as 
an input to logistic regression. In principle, any other transformation of 2 could work, as long as it 
gives a reasonable representation of the original data point. The choice of 2 is due to the following: 

(1) Optimality: Given a set of data points, their best representative is always the mean according to 
a wide variety of divergence functions (Banerjee et al., 2005b; Banerjee, 2007). We also notice that 
77^2 = r/fr E[z\ = E [rij : z ] , which means that if we take the mean of ij^z on each feature as the 
input to logistic transformation function, it is equivalent to using 77 ^ 2 as the input to that function. 

(2) Simplicity: Since 2 is the latent variable, if we use a complicated transformation on 2 such as a 
non-linear function, it would greatly increase the difficulty in inference and learning. 


16.4 Discriminative MNB 

Discriminative MNB is similar to DLDA, but it is designed for non-text data with real-valued fea- 
tures and it keeps separate distributions for each feature as MNB. Given the graphical model in 
Figure 16.1(b), the generative process for the data point x and label y is as follows: 

1. Choose a mixed membership vector 7r ~ Dirichlet(o:). 



Discriminative Mixed Membership Models 
2. For each non-missing feature j in x 


331 


(a) Choose a component Zj = c ~ discrete(7r), c £ {1, 2, ...k}. 

(b) Choose a feature value Xj ~ . (xj \6j C ). 

3. Choose the label from a multi-class logistic regression y ~ LR^-^ z,P 2 z, . . . , pj z). 
The density function for (x, y) is given by 

( ■ u \ 


p(x,y\a,e,y) = / p(n\a) 
J 7 r 


nEpfe = 

i i =1 C=1 / 

\ 3 ^ / 


p(y|z,?7)d7r . 


(16.5) 


The probability of the entire dataset of n documents and labels (X = {x, . [i]"}, 3? = {t/j, [?']"}) is 
given by 




d k 


=u p( 7r *i a ) nE^=ci^(*«iM 

2—1 ^ j — 1 C—l 

\ixi 


p(y l \z. l ,p)dn i . (16.6) 


For a concrete exposition to MNB models, we will focus on two specific instantiations of such 
models based on univariate Gaussian and discrete distributions for each feature in each component, 
corresponding to real-valued features and discrete features, respectively. Note that although the two 
examples we give have the same family of distributions across all features, DMNB allows different 
features to have different distributions and parameters. 

1. DMNB-Gaussian: Such models have Gaussian distributions for each feature, hence they are 
applicable to data with real-valued features. Given the model parameters a and {pi, <x 2 } = 
{(l-ijc &%), [j] i, [c]i}, the density function is given by: 


f(x, y\ot, p, cr 2 , p) 


(16.7) 


l 


d k 


= P(n\a) Yi^p{zj = c\n) 

J 7T -1 -1 


exp - 


( i x j n-jc) 3 

V 


I i=i c=i 


27ra 2 c 


p{y\z,p)dn. 


2. DMNB -Discrete: Such models have discrete distributions for each feature, hence are applicable 
to data with categorical features. Assuming that feature j can take r 3 possible values, each 
feature j and component c then has a discrete distribution {pjcip), [r]-^ }, where Pj C {f) > 0 and 
]Cr=i Pjc{r) = I- Given the model parameters a and p = {pj C (r), [r]/ , \j]f , [c] f } , the density 
function is given by 


p{x,y\a,p, p) 


(16.8) 


( 


« d k 

= p(Tr\a) Y[^2 p( z j = c \ n )Pjc{xj) 

J 7T -_i „ 1 


\ 


I j- 1 C= 1 


p{y\z,p)dTr. 


J 
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16.5 Inference and Estimation 

For a given dataset {X,y}, the learning task is to estimate the model parameters such that the 
likelihood of observing the whole dataset is maximized. A general approach for such a task is to 
use expectation maximization (EM) algorithms. However, the likelihood calculations in (16.3) and 
(16.5) are intractable, implying that a direct application of EM is not feasible. In this section, we 
introduce a variational inference method (Wainwright and Jordan, 2008), which alternates between 
obtaining a tractable lower bound to the true log-likelihood and choosing the model parameters 
to maximize the lower bound. To obtain a tractable lower bound, we consider an entire family of 
parameterized lower bounds with a set of free variational parameters, and pick the best lower bound 
by optimizing the lower bound with respect to the free variational parameters. 

16.5.1 Variational Approximation 

In most applications of the EM algorithm for mixture modeling, in the E-step, one can directly 
compute the latent variable distribution (Neal and Hinton, 1998; Banerjee et al., 2004), which is 
used to calculate the expectation of the likelihood; in the M-step, parameter estimation is done by 
maximizing the expectation of the complete likelihood. However, a direct computation of latent 
variable distribution p(jr, z|-) is not possible for DMM models. Hence, we introduce a tractable 
family of parameterized distributions <71 (7r, z|7, </>) as an approximation to p(tv, z|-), where (7, <fi) 
are free variational parameters. In particular, following Blei et al. (2003), in DLDA 

m 

9i(tt, z|7, <f>) =<?i(7r|7) J\qi(zj\<pj) , (16.9) 

i= 1 

and in DMNB 

d 

9 i(tt, z|7, (j>) = <7 i(tt| 7) qi(zj\(f>j) . (16.10) 

1=1 

3xj 

The plate diagram for q\ is in Figure 16.2(a). For each data point, 7 is a fc-dimensional Dirichlet 
distribution parameter over tv in both (16.9) and (16.10). f = {<f>j, [j]™} in (16.9) are parameters 
for discrete distributions over the topics 2 of all m words, and f = {4> :n [j]f, Bar,} in (16.10) are 
parameters for discrete distributions over the latent components z for all m non-missing features 
out of d features in total. 

Denoting the log-likelihood function with logp(x, y\a, A, p), where A = /3 for DLDA and 
A = 0 for DMNB, applying Jensen’s inequality (Blei et ah, 2003) yields: 

logp(x, y\a, A, 77) >E qi [logp(?r, z, x, y\a, A, r))] + H{ qi (Tv, z| 7 , </>)) . (16.11) 

Therefore, (16.1 1) gives a lower bound to logp(x, y\a , A, rf). For each {x,;, y;}, denoting the lower 
bound with L(ji, fp, a, A, rj), it can be expanded as 

L(7i, fa a, A, 7) =E qi [logp(7Ti|a)] + E qi [logp^^)] + E qi [logp(xf|zj, A)] 

- -E«[loggi(7ri|7i)] - E qi [\ogqi(zi\<t>i)\ + E qi [logp(j/f |z;, p)] . (16.12) 


It is easy to expand the first five terms in (16.12). For the last term, E qi [logp(t/,:|z;, 77)], there 
is no closed-form solution, but it could be lower-bounded after introducing a new parameter £. In 
particular, we have 
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(a) q\ for regular variational infer- (b) </2 for fast variational infer- 
ence ence 

FIGURE 16.2 

Variational distributions for regular and fast variational inference. 


E qi \\ogp(yi\zi,v)] 

t- 1 t- 1 

= E q i [^2 rilziyih - log(l + ^2 ex P {VhZi))} 

h = 1 h=l 

t— 1 k t— 1 

= ££ ^Ihc-^qi l^iciVih -^qi [log(l + £ eX p(^))] ■ (16.13) 

h—1 c=l h=l 

The second term of (16.13) could be expanded as follows: 


t - 1 

- E q i [log(l + ^2 ex P (VhZi))] 
h= 1 

t— 1 k 

> - log(l + ^2 Eqi [exp(^ VhcZic)]) 

h= 1 C— 1 

t— 1 k 

> -log(l + 2 ic exp(? ?ftc )]) 

fc = l C=1 

t— 1 k 

= -log(l + ££ E qi [z ic ] exp (% c )) 

h=l c— 1 

-, i— 1 fc ^ 

*-i££ Eq i [z ic ] exp(?y /lc ) + 1 - — - log(&), (16.14) 

h= i c= i & 

where the first inequality is from Jensen’s inequality, the second inequality is also from Jensen’s 
inequality, noticing that Zi is actually a discrete distribution, and the third inequality is from 
— log(a;) > 1 — | — log(£) (Minka, 2003a), by introducing a new variational parameter £ > 0. 
Given (16.14), the last term of (16.12) is lower-bounded by 
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E q i [logp{y l \z i ,r])} 


' * * / ^ \ i 

> E ^ ic ] E VhcVih ~ J exp(?7/ic) ) + 1 - J - log(Ci) 

C=1 h = 1 ' ^ / si 


(16.15) 


where £ gi [z ic ] = in DLDA, and £ 9l [ 2 ic ] = XEi 


j = l,3X;j V'tJC 


in DMNB. 


Therefore, for DLDA, the six terms in (16.12) are given as follows: 

/ k \ k 

E qi [logp(ni\a)] =logT ^ log T(a c ) 


^c=l 


C— 1 


+ - 1) ( *(7ic) - * E 


7ii 


c=l V M= 

k / / k 

Eqt [logp(Zj|7Tj)] = EE ^ijc $ (7ic) - ^ E 

j = lC=l \ \Z = 1 

mi k V 

^9i[l°gP( x i|^) z i)] = EEE fiijcKij log /3 CV > 

J = 1 C— 1 D — 1 

/ k \ k 

E qi [log <?1 (tTi 1 7i )] =logT ^7ic -^logr( 7 ic ) 


7i* 


+ £><c - !) I ^(7*c) - ^ ( E> 


C= 1 

U fe 


a=i 


^gi [log 9l(Zi !</»*)] =EEfe'°gfe , 

j=l c=l 

fc rrii t— 1 


^ ac rrii z— l ^ \ 

E qi [logp(yi\zi, p )}> — ^ ^ E &E ( “ T ex P (.Vhc) ) 

mi c=l i= 1 h= 1 V « ' 


C=1 J = I /l=l 

+ 1 - 7- - log(6)- 

Si 

For DMNB, the six terms in (16.12) are given as follows: 

/ k \ k 

^9i[l0gp( x i|«)] =l°g r E“ c “ E logr ( ac ) 


+ E( ac “l)[*(7io)-*(E 


lil 


C— 1 

d k 


kI = 1 


Eqi [log pfai | TTz )] = EE ^(7*c) - ^ (E 'Yu J J , 

ij 

d k 

^gi[l°gP( x i|zi,0)] = EE 4*ijc log P^(x ij\@jc)i 


(16.16) 

(16.17) 

(16.18) 

(16.19) 

(16.20) 

(16.21) 


(16.22) 

(16.23) 

(16.24) 
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/ k \ k 

Eg 1 [log gi (7Ti|7i)] = log r 1^2 7 ic - !°g r (7ic) 

\c= 1 / C=1 

(7*c - 1) ( ^(7*c) - ^ 

C=1 \ 

d k 

E q i [log gi(zil^)] = EE 0ij C log <t>ij c , ( 16 . 26 ) 

3 = 1 c=1 
^ d t— 1 

Eq-L [log I Zj , 77)] > — EEE 

1 C— 1 j = l 1 

+ 1 - ^ - logfe). (16.27) 

S 2 

After introducing £, the lower bound of the log-likelihood for each data point (x t . y,) can be 
represented as L('y i ,<f> i ,£ i ;a,A,r]). The lower bound of the log-likelihood on the whole dataset 
{X, J 7 } is simply the summation of £( 7 *, a, A, 77 ) over all data points. The best lower bound 
can be computed by maximizing each L(ji, a, A, if) over the free parameters (7 *, A 

direct calculation gives the following update equations that iteratively maximize the lower bound. 
In particular, for DLDA, we have 


riJC 


VhcVih - j- exp(% c ) 
Si 




(16.25) 


4>ijc oc exp 


^(7 ic) - ^ E 7iZ 

c \i= 1 / 


V 

+ E4 lo gfe 

V = 1 


1 4-1 \ 

H E (VhcVih - exp (r)hc)/£i) , 

m i ti J 

(16.28) 

Aic — tX c -f ^ [ ( pi jo 

1 = 1 

(16.29) 

J 

^ =1 + ^ E E E exp(% c ) . [*]?, bir% [c]j . 

1 ft.=l c — 1 7=1 

(16.30) 

For DMNB, we have 


4> ijc oc exp (\P( 7 i c ) - ^E 7ii ^ + lo gP^j MifMfc) 


1 4-1 

H E ^ dhcVih - exp(% c )/&) j, 

TOi h=i 

(16.31) 


d 

r fic OL c + ^ ^ 4*ijci ( 16 . 32 ) 

3 = 1 

^Xij 

^ t— 1 k d 

6 =1 + — EEE exp (r)hc) ■ [i]i, bit. Mi • (16.33) 

^ /l— 1 C = 1 J = 1 


For a specific model, such as DMNB-Gaussian, the updating equation for (f>ij c could be obtained 
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by replacing the corresponding distributions in place of p<s> {xij\6j C ) in (16.31). The form of the 
updates for 7 \ c and f, is independent of the exponential family being used. 


16.5.2 Parameter Estimation 


The goal of parameter estimation is to obtain (a, A, rf) such that log p(X, y \a, A, rf) is maximized. 
Since the log-likelihood is intractable, one can use the lower bound as a surrogate objective to be 
maximized. Note that for a fixed value of the variational parameters ( 7 *, ) for each (x;, yf), 

the lower bound of logp(A’, y \a, A, rf), i.e., -^( 7 *> <!>*, £* ; a , A, rf), is a function of the param- 
eters (a, A, rf). Maximizing ^" =1 L( 7 *, <p* , ; a , A, rf) with respect to ( a , A, rf) yields parameter 

estimate. 

The update of a is independent of the specific model. Using the Newton-Raphson algo- 
rithm (Blei et al., 2003; Minka, 2003b) with line search, the updating equation is: 


where 


= a r — v- 


9c - u 


Ji ) 


9c 

h c 

u 


w 




-n^'(a c ) 

EjU 9i/hi 

w~ l + Ef=i K 1 


k 

nT' / (y^q ; ) . 

1=1 





(16.34) 


Since a has the constraint of a c > 0, by multiplying the second term of (16.34) by v, we are 
performing a line search to prevent a c to go out of the feasible range. At the beginning of each 
iteration, v is set to 1. If the updated a c falls into the feasible range, the algorithm goes on to the 
next iteration, otherwise, it reduces ct by a factor of 0.5 until the updated a c becomes valid. 

For other model parameters, the update for // is given by 


Vhc = log 


E n \ 1 / 

i— 1 2 ^ 7=1 yih<Pijc/mi 


, m 


t - 1 


Er=iEr=i^-c/K^) 

The update equation for A is model dependent. For DLDA, the update equation for A, i.e., /?, is 
given as follows: 


PcvOC^Y^ fiijcXij » [c]i , M 

i = 1 3=1 


(16.35) 


For DMNB, following Redner and Walker (1984) and Banerjee et al. (2005b), the parameters A, 
i.e., 0, can be estimated in a closed form for all exponential family distributions. From the Bregman 
divergence perspective, let t jc be the expectation parameter for the yth feature of the cth component, 
the estimation for Tj c is given by 


E Tl 

i— 


i=l,3x i:j 


E n 

i= 


\j)i 


[c]\ 




HlJC 


(16.36) 


where Sij is the sufficient statistic. The natural parameter 0 JC is given by conjugacy as 


Ojc = Vfj(T jc ) , \j]i , lc] k 1 , 
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where fj ( • ) is the conjugate of cumulant function ipj for each feature. We now give the parameter 
estimation for two special cases: DMNB-Gaussian and DMNB-Discrete. 


DMNB-Gaussian: For Gaussians, by maximizing the lower bound, the exact update equations for 
p,j C and crj c can be obtained as 


Pjc 


Si= l,3xij 4>ijcXij 
l,3xy 

Z)i=l,3xy fojdXij ~ PjcY d fc 
j i Uli i L c Ji ■ 

2_(i=l,3xy 


(16.37) 

(16.38) 


DMNB-Discrete: For a discrete distribution p ]c over r = 1 , . . . , r ; values for feature j, the estimate 
of Pj C (r) is given by 

n 

Pjc(r) oc5> c l(*« = r) , [)]f , [r]^ , (16.39) 

i= 1 

where l(xij = r ) is the indicator of observing value r for feature j in observation x,. While such 
a maximum likelihood (ML) estimate will give the maximizing parameters on an observed training 
set, there is the possibility of some probability estimates being zero. Such an eventuality does not 
pose a problem on the training set, but inference on unseen or test data may become problematic. 
If a feature in the test set takes a value that it has not taken in the entire training set, the model will 
assign a zero probability to the entire set of test observations. The standard approach to address the 
problem is to use smoothing, so that none of the estimated parameters is zero. In particular, we use 
Laplace smoothing, which results from a maximum a posteriori (MAP) estimate (DeGroot, 1970) 
assuming a Dirichlet prior over each discrete distribution, so that 


Pjc{ r ) = £ **!(*<, =r) + e, [c}\ , \j]f , [r]^ J , (16.40) 

»= l 


for some e > 0 . 


16.5.3 Variational EM for DMNB 

Based on the variational inference and parameter estimation updates, it is straightforward 
to construct a variational EM algorithm to estimate (a, A, 77 ). Starting with an initial guess 
(a(°\A(°\ 7 / 0 )), the variational EM algorithm alternates between two steps: 

1. E-Step: Given (a*'* -1 ), A^ _1 \ t/ 4-1 -*), for each data point x», find the optimal variational pa- 
rameters _ , 

(7 1 ) = argmaxL( 7 i ,</> i ,^;a (t " 1) ,A (t " 1) ,? ? ( ^ 1) ) . 

L(j A, 7 y) gives a lower bound to logp(x i; y^a, A, 77 ). 

2. M-Step: An improved estimate of model parameters (a, A, 77 ) are obtained by maximizing the 
aggregate lower bound: 


(a (t) , A (t) ,? 7 (t) ) = arg max ^ A, 77 ) . 
(a, A,jj) i=1 
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After t iterations, the objective function becomes In (t+ \) th 

iterations, we have 

n 

i= 1 

n 

i— 1 
n 

< £ L ^l t+1) . 4‘ +1) . d t+1) ; « (t+1) , A (t+1) , r? (t+1) ) • 

i=l 

The first inequality holds because in the E-step, maximizes L(ji, </>j, f-,;. a^\ 

A jjM). The second inequality holds because in the M-step, (a^ +1 \ A^ t+1 \ maximizes 

L(7j t+1 \ (pf +1 \ ot, A, 77). Therefore, the objective function is non-decreasing until conver- 

gence. 


16.6 Fast Variational Inference 

The variational distribution in Section 16.5 exactly follows the idea proposed for latent Dirichlet 
allocation (LDA) (Blei et al., 2003), where every feature j of the data point x, has a variational 
parameter 0 , ; for the corresponding discrete distribution. This section introduces a different varia- 
tional distribution with a smaller number of parameters, yielding a much faster variational inference 
algorithm. The fast variational inference is used for both DLDA and DMNB. 


16.6.1 Variational Approximation 

Given the lower bound to log-likelihood of each data point as (16.11), the variational distributions 
in (16.9) and (16.10) assign a separate discrete distribution to each x t] of the data point x, ; . The 
total number of < faj needed is hence ^"=1 m i » which is a huge number for high-dimensional data. 
Meanwhile, since in the E-step of the EM algorithm the optimization is performed over each varia- 
tional parameter, a large number of variational parameters will lead to a large number of optimiza- 
tions to perform, significantly slowing the algorithm down. To make the algorithm more efficient, a 
new family of variational distributions q 2 are introduced (Figure 16.2(b)). In particular, for DLDA, 


52(77, z#, 7) = q 2 ( tt| 7) i\<t>) , (16.41) 

3 = 1 


and for DMNB, 

d 

52(77, z 10, 7) = 92M7) 92(^10) , (16.42) 

3 = 1 

3xj 

where 7 and 0 are /.'-dimensional variational parameters for Dirichlet and discrete distributions, 
respectively. Compared to 91(77, z|0, 7) in (16.9) and (16.10), 92(77, z|0, 7) only has one 0 for 
each data point. The total number of 0s needed hence decreases from l* 1 (16-9) and 

(16.10), to n in (16.41) and (16.42). Accordingly, the number of optimizations over 0 also de- 
creases from 1 m i to n - Such a reduction implies a big saving on space and time, especially for 
high-dimensional data. 
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Given the variational distribution <72, one could have a lower-bound function to the log- 
likelihood of each data point similar as (16.12): 

= E q2 [logp(TTi\a)] + E q2 [logp(z;|7ri)] + E q2 [logp(xj|z;, A)] 

- E q2 [logq 2 {tri\ji)\ - E q2 [\ogq 2 {zi\<t>i)\ + £<? 2 [logp(i/,:|z;, 7)]. (16.43) 

For DLDA, the six terms in (16.43) are given as follows: 


E q2 [logp(n z \a)\ =logr -H lo g r (“c) 

\c= 1 ) C— 1 

+ EE C - 1) ^(7ic) - , 

E g2^°&P( z i\ 7r i)\ (E^II > 


C— 1 

rrii k V 


< 1=1 


Eq 2 [l0gp(xLi\l3,Zi)} = EEE <t>icXij log /3 CV 7 

j — 1 C—l V—l 

/ k \ fc 

£ 92 [logg 2 (>i|7i)] =logr XE C ) -E l0 S r (^) 

\c=l / C— 1 

+E(^ - ^ (e 


lil 


C— 1 


,Z = 1 


-E 92 [log 92 (Zj \<t>i)\ =rrii ^2 4> ic log (t> ic , 

C= 1 

^ fc rrii t— 1 / ^ \ 

E qi [l°gP(yj| z i, ??)] ^E^EE %c2/ifr ^ j- exp (% c ) ) 
m * c =i j = 1 h = 1 V / 


+ 1- - -log(&). 

Si 


(16.44) 

(16.45) 

(16.46) 


(16.47) 

(16.48) 


(16.49) 


For DMNB, the six terms in (16.43) are the same as in (16.44)-(16.49), except for (16.46), which 
is given by 

d k 

£ 92 [logp( x i|z>,0)] = EE <t>ic^ogp^{xij\6j c ) . 

2 = 1 c=l 
Bxij 

Maximizing i(7i, (j>i, cx, A, 77) with respect to variational parameters yields the best lower 
bound. In particular, in DLDA, 


( Pic OC 


7 ic = 


6 = 


rrii V 


exp |^( 7 ic) -^lE^I+^EE x ij lo § ) ’ 
OL c A 77 / Lie ■ 


vl = l 


2=1 „=1 


t— 1 k 

1 + EE 4> ic exp(r]hc) , [*] ", [c]* . 

h=l c— 1 


(16.50) 

(16.51) 


(16.52) 
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In DMNB, 


j K \ l a 

(f>i C oc exp (*(7ic) - * ( E lil J 3“ m 5 ] lQ g P^j i x ij I 


vi=l 


J=1 


t-1 


+ — 53 ( VhcVih - ex P(%c)/£i)) , 

2 /l=l 

Tic — CKc “I” TTli&ici 
t— 1 k 

6=i + EE 0 ic exp(?^ c ) , [i]", [c]i . 

/i-l C=1 


(16.53) 

(16.54) 

(16.55) 


Again, for a specific model of DMNB, such as DMNB-Gaussian, the updating equation for <j>i C 
could be obtained by replacing the corresponding distributions in place of p ^ . (xij \0j C ) in (16.53). 
The form of the updates for -j lr and f, is independent of the exponential family being used. 


16.6.2 Parameter Estimation 


After obtaining the variational parameters, one can obtain a tractable lower bound of the log- 
likelihood as a function of the model parameters (a, A, rj). The estimation for a is the same as 
in Section 16.5.2 using the Newton-Raphson algorithm with line search. The estimation for ij is 
given by 


r) hc = log 


Ej=l Uihfic 

E"= 1 4>icKi 


[c] k i , [h]\ 


for both DLDA and DMNB. 

For the estimation of A in DLDA, the update equation of /3 is given by 


n 

Pcv OC 53 

i= 1 



[c] k i 


nr 


(16.56) 


For the estimation of 0 in DMNB, from a Bregman divergence perspective, assuming the expecta- 
tion parameter for the yth feature of component c is r JC , the estimation for Tj C is given by 


t~ic — 


E n 

i= 1 


3Xi 


E n 

i— 


mi , 


[c] k i 


i=l,3xi- 


(16.57) 


where s t j is the sufficient statistic and the natural parameter 0j c = V fj{rj c ) by conjugacy, given 
fj ( • ) the conjugate of cumulant function tpj for each feature. For two special cases, DMNB- 
Gaussian and DMNB-Discrete, the closed-form parameter estimates are given below. Note that 
(16.57)-(16.60) are mild variants of (16.36)— (16.39), as (b rc does not depend on feature j. 


Fast DMNB-Gaussian: For Gaussians, the update equations for n JC and er| c are given by 


9jc — 


Ei= l,Bxij & ic x ij 
E*= l,3xij He 
E»=l,3a;y HcHij — Itjc) 2 


jc 


E li 

i= 1. 


Jl 5 


U\ 




(16.58) 

(16.59) 


Fast DMNB-Discrete: For a discrete distribution pj c over r = 1 . . . . , r, values for feature j, the 
update equation for Pj C {r) is given by 
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Pjc{r ) = r) + e , [c]\ , \j]f , [r]p, (16.60) 

i=l 

where l(xij = r) is the indicator of observing value r for feature j in observation Xi. 

Given the updates for variational and model parameters, a variational EM algorithm could be 
constructed as in Section 16.5.3. 


16.7 Experimental Results 

In this section, experimental results for discriminative mixed membership models are presented. 
For simplicity, DMM models with the regular variational inference algorithm following Blei et al. 
(2003) are referred to as “Standard DMM” (Std DMM), which includes Standard DLDA (Std 
DLDA) and Standard DMNB (Std DMNB). DMM models with the fast variational inference al- 
gorithm are referred to as “Fast DMM,” which includes Fast DLDA and Fast DMNB. An overview 
of all DMM models is given in Figure 16.3. In this section, first, DMM models are compared to their 
unsupervised counterpart — mixed membership models. Second, Fast DMM models are compared 
to standard DMM models. Further, Fast DMM are compared to several state-of-the-art classification 
algorithms. Finally, several word lists of topics generated from Fast DLDA are presented. The ex- 
periments are performed using 10-fold cross-validation. In particular, the dataset is divided evenly 
into ten parts, one of which is picked as the test set, and the remaining nine parts are used as the 
training set. The process is repeated ten times, with each part used once as the test set. The mean 
and standard deviation of the results on test sets over 10 folds are presented. 



FIGURE 16.3 

An overview of DMM models. 


16.7.1 Datasets 

Seven datasets from the UCI machine learning repository 1 are used for the experiments of DMNB. 
These datasets are represented as real-valued full matrices without missing entries. The numbers of 
data points, features, and classes in each dataset are listed in Table 16.1. 
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TABLE 16.1 

The number of data points, features, and classes in each UCI dataset. 

Dataset Ecoli Glass Iono Seg Sona Wdbc Wine 

Data points 336 214 351 2310 208 569 178 

Features 7 9 32 19 60 30 13 

Classes 8 6 2 7 2 2 3 


Five text datasets are used for the experiments of DLDA. The details of the datasets are as 
follows: 

1. Nasa: Nasa is a text dataset from the Aviation Safety Reporting System (ASRS) online 
database. 1 2 This database contains aviation safety reports submitted by pilots, controllers, and 
others. The dataset used is a subset of the whole database. It contains 4,226 documents about 
the anomalies originating from three sources: flight crew, maintenance, and passengers. The 
vocabulary size is 604. 

2. Classic3: Classic3 (Dhillon et al., 2003) is a well-known text dataset. It contains 3,893 doc- 
uments from three different classes including aeronautics, medicine, and information retrieval. 
The vocabulary size is 5,923. 

3. CMU Newsgroup: The CMU Newsgroup is also a benchmark text dataset (Lang, 1995). 
The standard dataset of CMU Newsgroup contains 19,997 messages, collected from 20 dif- 
ferent USENET newsgroups. Three subsets are used for the experiments: (1) Diff is a col- 
lection of 3000 messages from 3 different newsgroups with 1000 messages for each class: 
alt. atheism, rec.sport.baseball, and sci. space. The vocabulary size is 7,666. (2) Sim is a collec- 
tion of 3000 messages from 3 somewhat similar newsgroups with 1000 messages for each class: 
talk.politics.guns, talk.politics. mideast, and talk.politics.misc. The vocabulary size is 10,083. (3) 
Same is a collection of 3000 messages from 3 very similar newsgroups with 1000 messages for 
each class: comp. graphics, comp. os. ms-windows, and comp.windows.x. The vocabulary size is 
5,932. 


16.7.2 DMM vs. MM 

We first compare DMM models to corresponding MM models. In particular, we compare DLDA 
with LDA, and compare DMNB with MNB. Both the regular and fast variational inference are used 
for each model. In principle, MM models are not used for classification, but given the initialization 
we will introduce below, there is a one-to-one mapping between the component and the class; hence, 
we can measure the accuracy. 

For initialization, the model parameters are initialized using all data points and their labels in the 
training set, in particular, the number of components k is set to be the number of classes t; the mean 
and standard deviation (for Gaussian case only) of the data points in each class are used to initialize 
A; and rih /n are used to initialize each dimension of a , where »/, is the number of data points in 
class h and n is the total number of data points. // in DMM is set by cross validation. In particular, 
each r/h of [ft.]* -1 in ?y takes value of ruh , where Uh is a unit vector with the hth dimension being 1 
and others being 0, and r takes values from 0 to 100 in steps of 10. The value of r which gives the 
best results on a validation set is used to set up ry. 

1 See http://archive.ics.uci.edu/ml/. 

-See http://akama.arc. nasa. gov/ASRSDBOnline/QueryWizard\_Begin.aspx. 



Discriminative Mixed Membership Models 


343 


The results for DLDA and DMNB are presented in Tables 16.2 and 16.3, respectively. Com- 
paring DMM models with the corresponding MM counterparts, one can see that while Std DMM 
models are not necessarily better than Std MM models. Fast DMM models are almost always bet- 
ter than Fast MM models. Overall, Fast DMM models achieve the highest accuracy among four 
algorithms. The higher accuracy of Fast DMM demonstrates the effects of logistic regression in 
accommodating label information for DMM models. 


TABLE 16.2 

Accuracy for LDA and DLDA ( k=t ). Fast DLDA has a higher accuracy on all datasets. 



Nasa 

Classic3 

Diff 

Sim 

Same 

Std LDA 

0.9140 

0.6733 

0.9677 

0.8143 

0.5633 

±0.0140 

±0.0254 

±0.0069 

± 0.0161 

±0.0243 

Std DLDA 

0.9220 

0.6710 

0.9600 

0.8140 

0.6267 

±0.0127 

±0.0256 

±0.0089 

±0.0252 

±0.0348 

Fast LDA 

0.9194 

0.6748 

0.9773 

0.8553 

0.7730 

±0.0148 

± 0.0242 

± 0.0110 

±0.0197 

±0.0205 

Fast DLDA 

0.9237 

0.6756 

0.9800 

0.8653 

0.7900 

±0.0163 

±0.0234 

±0.0102 

±0.0182 

±0.0315 


TABLE 16.3 

Accuracy for MNB and DMNB ( k=t ). Fast DMNB has a higher accuracy on most of the datasets. 



Ecoli 

Glass 

Iono 

Seg 

Sona 

Wdbc 

Wine 

Std 

0.7895 

0.6190 

0.6829 

0.6514 

0.6300 

0.9321 

0.9606 

MNB 

±0.0629 

± 0.1052 

±0.0579 

±0.0293 

±0.0789 

±0.0351 

±0.0500 

Std 

0.7788 

0.6048 

0.7314 

0.6398 

0.6102 

0.9397 

0.9647 

DMNB 

±0.0554 

±0.1231 

±0.0895 

±0.0397 

±0.0822 

± 0.0378 

±0.0411 

Fast 

0.7950 

0.5952 

0.7486 

0.6333 

0.6100 

0.9089 

0.9470 

MNB 

±0.0595 

±0.0645 

±0.0643 

±0.0676 

±0.0516 

±0.0309 

±0.0647 

Fast 

0.8152 

0.5238 

0.8507 

0.6701 

0.6600 

0.9286 

0.9765 

DMNB 

± 0.0862 

±0.1209 

± 0.0891 

± 0.0487 

± 0.0876 

±0.0253 

± 0.0304 


16.7.3 Fast DMM vs. Std DMM 

From Tables 16.2 and 16.3, comparing Fast DMM with Std DMM, one can see that Fast DLDA has 
a higher accuracy than Std DLDA, and Fast DMNB generally also has a higher accuracy than Std 
DMNB, with only one exception. 

One can also compare the running time between Std DMM and Fast DMM. The results for 
DLDA and DMNB are presented in Tables 16.4 and 16.5, respectively. In Table 16.5, although most 
of the datasets are small. Fast DMNB is already faster than Std DMNB. Fast DMM’s advantage 
increases when it comes to the larger and higher-dimensional text data as in Table 16.4, where Fast 
DLDA is about 20 to 150 times faster than Std DLDA, showing Fast DMM models’ significant 
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superiority in terms of time efficiency. Therefore, Fast DMM models are generally more accurate 
and substantially faster than Std DMM models. 


TABLE 16.4 

Running time (seconds) of Std DLDA and Fast DLDA. Fast DLDA is computationally more efficient 
than Std DLDA. 



Nasa 

Classic3 

Diff 

Sim 

Same 

Dimension 

604 

5923 

7666 

10083 

5932 

Std DLDA 

549.17 

±5.74 

2176.67 

±21.62 

1752.78 

±22.36 

2344.64 

±966.50 

1981.46 

±289.24 

Fast DLDA 

3.63 

±0.21 

114.34 

±18.13 

27.56 

±0.61 

36.10 

±2.98 

40.18 

±5.83 

Speedup times 

151 

19 

64 

65 

49 


TABLE 16.5 

Running time (seconds) of Std DMNB and Fast DMNB. Fast DMNB is computationally more 
efficient than Std DMNB. 



Ecoli 

Glass 

Iono 

Seg 

Sona 

Wdbc 

Wine 

Dimension 

7 

9 

32 

19 

60 

30 

13 

Std 

4.65 

2.76 

5.20 

120.26 

4.89 

3.33 

2.26 

DMNB 

±1.13 

± 0.49 

± 3.11 

±77.27 

±4.51 

±0.40 

±0.25 

Fast 

3.97 

2.21 

0.82 

25.37 

1.03 

1.91 

1.10 

DMNB 

±0.39 

±0.21 

±0.01 

±6.32 

±0.07 

±0.11 

±0.04 

Speedup times 

1.17 

1.25 

6.34 

4.74 

4.75 

1.74 

2.05 


We further investigate the cluster assignments of Fast DMM. The cluster membership of each 
data point could be considered as its probability belonging to different clusters. If one calculates the 
Shannon entropy of the cluster membership, a high entropy indicates a real mixed membership as- 
signment, while a low entropy implies almost a sole membership. Figure 16.4 shows the histograms 
of cluster membership entropy for Std DLDA and Fast DLDA on Sim, and for Std DMNB and 
Fast DMNB on glass, where each bar denotes the number of data points falling into that range of 
entropy. While most data points from Std DMM have a large entropy over different ranges, the data 
points from Fast DMM mostly have a small entropy. The interesting observation indicates that fast 
variational inference actually generates somewhat sole membership while the regular variational 
inference generates real mixed membership. Such observation gives one possible explanation for 
Fast DMM’s better classification performance than Std DMM: In a (single-label) classification sce- 
nario, each data point only belongs to one class; hence, the sole membership from Fast DMM would 
probably be more appropriate than the mixed membership. 

One possible reason for the sole membership from fast variational inference is as follows: In 
the E-step, DMNB iterates through (16.31) and (16.32) to update f and 7 , while Fast MNB iterates 
through (16.53) and (16.54). The expression for 7 in (16.54) contains the summation of <f>j over all 
features j. Since each <pj may take different values, in the sense that each <p :l may peak at different 
components, the summation of (j>j may have several peaks on different components. Accordingly, 
7 will also have several peaks, leading to a mixed membership over those peaked components. In 
comparison, the expression for 7 in (16.54) has a term of rn<p instead, so no matter which component 
4> peaks at, the peak will be greatly enhanced in 7 , and such enhancement in 7 will further increase 
the “sole membership” nature of <f> through the term exp ^ (7 * c ) — 'I' (Y^—i 'Yu')') in (16.53). By 
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(c) 


(d) 


FIGURE 16.4 

Histogram of cluster membership entropy on glass for Std DMNB and Fast DMNB. (a) is for Std 
DMNB and (b) is for Fast DMNB. Fast DMNB assigns most data points a “sole” membership, while 
Std DMNB assigns most data points a real mixed membership. 


iterating through 7 and 0, the accumulated enhancement finally leads to almost a sole membership 
on the peaked component. 

16.7.4 Fast DMM vs. Other Classification Algorithms 

Since Fast DMM models have better performance than Std DMM models, one can use Fast DMM 
to compare with other classification algorithms. In this chapter. Fast DMNB is compared with the 
support vector machine (SVM) (Chang and Lin, 2001), logistic regression (LR), and naive Bayes 
classifier (NBC) 2 on UCI data; and Fast DLDA is compared with SVM, NBC, LR, and a mixture 
of the von Mises-Fisher (vMF) model (Banerjee et al., 2005a) on text data. Since DMM is a com- 
bination of logistic regression and mixed membership models, it is also interesting to compare the 
results from DMM to the results from MM and logistic regression in two steps sequentially. 

For Fast DMNB, the number of components k is set to be (t, t + 5 . t + 10), and for Fast DLDA, 
k is set to be (i,f + 15, i + 30, t + 50, t + 100). The initialization of A is based on the mean 
and standard deviation (for Gaussian case only) of the training data in given classes plus some 

-Note that naive Bayes used in this subsection is the classifier instead of the clustering algorithm. 
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perturbation if k > t. a is set to be 1 jk on each dimension, and 77 is also from cross validation as in 
Section 16.7.2. 

The results for Fast DLDA and DMNB are presented in Tables 16.6 and 16.7. The top parts of 
the tables are the results from the generative models, and the bottom parts are the results from dis- 
criminative classification algorithms. Bold is used for the best results among the generative models, 
and bold and italic are used for the best results among all algorithms. Three parts of information 
could be read from the tables: 

1 . Overall, on text datasets. Fast DLDA does better than all other algorithms, including S VM, on 
almost all datasets, which is a promising result, although more rigorous experiments are needed 
for further investigation; on UCI datasets. Fast DMNB also achieves higher accuracy than other 
algorithms on most of the datasets except SVM, which outperforms Fast DMNB five out of nine 
times. 

2. The better performance of Fast DMM models compared to LR on original datasets indicates 
that the low-dimensional representation from Fast DMM helps the classification. 

3. Interestingly, for Fast DMNB, the accuracy increases monotonically with k from t to / + 10 on 
most of the datasets. For Fast DLDA on text data, an increase of accuracy with a larger k is also 
observed, although the result goes up and down without a clear trend. One possible reason for 
the increasing accuracy is as follows: When k is too small, it is performing a drastic dimension 


TABLE 16.6 

Accuracy of Fast DLDA and other classification algorithms on text data. Fast DLDA has higher 
accuracy on most datasets. 



Nasa 

Classic3 

Diff 

Sim 

Same 

Fast DLDA 

0.9237 

0.6756 

0.9800 

0.8653 

0.7900 

( k=t) 

±0.0163 

±0.0234 

±0.0102 

±0.0182 

±0.0315 

Fast DLDA 

0.9232 

0.6858 

0.9747 

0.8713 

0.8458 

(k=t+ 15) 

±0.0144 

±0.0216 

±0.0121 

±0.0264 

±0.0214 

Fast DLDA 

0.9301 

0.6838 

0.9817 

0.8707 

0.8468 

(fc=t+30) 

±0.0128 

±0.0234 

±0.0099 

±0.0228 

±0.0190 

Fast DLDA 

0.9237 

0.6854 

0.9823 

0.8700 

0.8150 

(k=t+ 50) 

±0.0138 

±0.0211 

±0.0083 

±0.0230 

±0.0184 

Fast DLDA 

0.9261 

0.6866 

0.9760 

0.8718 

0.8347 

(fc=t+100) 

±0.0102 

±0.0245 

±0.0108 

±0.0182 

±0.0187 

vMF 

0.9216 

0.6509 

0.95301 

0.7447 

0.7600 

±0.0113 

±0.0246 

±0.0071 

±0.0214 

±0.0347 

NBC 

0.9334 

0.6766 

0.9813 

0.8613 

0.8410 

±0.0094 

±0.0230 

±0.0069 

±0.0216 

±0.0262 

LR 

0.9209 

0.6396 

0.9553 

0.6750 

0.4823 

±0.0157 

±0.0252 

±0.0157 

±0.1330 

±0.1283 

SVM 

0.9192 

0.6854 

0.9563 

0.8357 

0.8120 

±0.0146 

±0.0278 

±0.0105 

±0.0156 

±0.203 
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reduction to represent each data point in a fc-dimensional mixed membership representation, 
which may cause a huge loss of information, but the loss may decrease when k increases. 


TABLE 16.7 

Accuracy of Fast DMNB and other classification algorithms on UCI data. Fast DMNB has a higher 
accuracy, except for SVM. 



Ecoli 

Glass 

Iono 

Seg 

Sona 

Wdbc 

Wine 

Fast DMNB 

0.8152 

0.5238 

0.8507 

0.6701 

0.6600 

0.9286 

0.9765 

( k=t ) 

±0.0862 

±0.1209 

±0.0891 

±0.0487 

±0.0876 

±0.0253 

±0.0304 

Fast DMNB 

0.8392 

0.5248 

0.8543 

0.7632 

0.8100 

0.9393 

0.9882 

(k=t+ 5) 

±0.0836 

±0.0643 

±0.0908 

±0.0412 

±0.0907 

±0.0388 

±0.0284 

Fast DMNB 

0.8485 

0.5667 

0.8943 

0.7684 

0.8200 

0.9375 

0.9765 

(k=t+lO) 

± 0.0515 

± 0.1015 

±0.0786 

±0.0418 

±0.1509 

±0.0329 

±0.0411 

NBC 

0.8363 

0.4333 

0.8114 

0.6850 

0.7268 

0.9339 

0.9705 

±0.0745 

±0.1318 

±0.0853 

±0.0625 

±0.0079 

±0.0266 

±0.0310 

LR 

0.8030 

0.5109 

0.8400 

0.8307 

0.7500 

0.9429 

0.7471 

±0.0610 

±0.1234 

±0.0276 

±0.0358 

±0.0816 

±0.0250 

±0.1469 

SVM 

0.8349 

0.4676 

0.9171 

0.9745 

0.7450 

0.9536 

0.9765 

±0.0670 

±0.0875 

±0.0594 

±0.0096 

±0.0896 

±0.0173 

±0.0304 


Fast DMM models do dimensionality reduction and classification in one step via a combination 
of Fast MM and logistic regression. In principle, one can use these two algorithms sequentially in 
two steps, i.e., first use Fast MM models to get a low-dimensional representation, and then apply 
logistic regression on the low-dimensional representation for classification. The results for these 
two strategies are presented in Figure 16.5. It is clear that Fast DMM models outperform the Fast 
MM+LR strategy. Therefore, by combining Fast MM and logistic regression together. Fast DMM 
achieves supervised dimensionality reduction to obtain a better low-dimensional representation than 
Fast MM, which helps classification. 



(a) (b) 

FIGURE 16.5 

Comparison between using Fast MM+LR and Fast DMM. (a) is for Fast DLDA on text data, and 
(b) is for Fast DMNB on UCI data. Fast DMM achieves higher accuracy, indicating the advantage 
of supervised dimension reduction. 
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16.7.5 Topics from Fast DLDA 

As mentioned before, DMM models generate interpretable results. An example of several topic word 
lists on Nasa generated by Fast DLDA (k = t + 30) is given in Table 16.8. It is also an interesting 
result demonstrating the effect of allowing a larger number of components than the number of 
classes ( k > t), that is. Fast DLDA may discover topics which are not explicitly specified in class 
labels, while maintaining the predefined number of classes. The first three topics in Table 16.8 
correspond to three classes in Nasa, respectively, but Topic 4, which we call “passenger medical 
emergency,” could be considered as a subcategory of the “passenger” class, and it is not specified in 
the labels. Neither NBC nor SVM is able to generate this type of results. 


TABLE 16.8 

Extracted Topics from Nasa dataset using Fast DLDA. 

Topic 1 Topic 2 Topic 3 Topic 4 


runway 

maintenance 

passenger 

passenger 

aircraft 

aircraft 

flight 

flight 

approach 

flight 

attendant 

medical 

tower 

minimum equipment list 

told 

attendant 

cleared 

time 

captain 

emergency 

landing 

check 

seat 

aircraft 

airport 

engine 

asked 

doctor 

turn 

mechanical 

back 

landing 

taxi 

installed 

attendants 

attendants 

traffic 

part 

aircraft 

captain 

final 

inspection 

lavatory 

oxygen 

controller 

work 

crew 

paramedics 


16.8 Conclusion 

In this chapter, we have discussed discriminative mixed membership models as a combination of 
unsupervised mixed membership models and multi-label logistic regression. We introduced a fast 
variational inference algorithm which is substantially faster than the mean field approximation used 
in LDA (Blei et al., 2003) and leads to even better classification performance. An important property 
of DMM models is that they allow the number of components k to be different from the number 
of classes c. Interestingly, a larger k helps to discover the components not specified in labels and 
increases classification accuracy. In addition, DMM models are competitive with the state-of-the-art 
classification algorithms in terms of their accuracy, especially on text data, and are able to generate 
interpretable results. 
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Discrete mixed membership modeling and continuous latent factor modeling (also known as matrix 
factorization) are two popular, complementary approaches to dyadic data analysis. In this chapter, 
we develop a fully Bayesian framework for integrating the two approaches into unified Mixed Mem- 
bership Matrix Factorization (M 3 F) models. We introduce two M 3 F models, derive Gibbs sampling 
inference procedures, and validate our methods on the EachMovie, MovieLens, and Netflix Prize 
collaborative filtering datasets. We find that even when fitting fewer parameters, the M 3 F models 
outperform state-of-the-art latent factor approaches on all benchmarks, yielding the greatest gains 
in accuracy on sparsely-rated, high-variance items. 
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17.1 Introduction 

This chapter is concerned with unifying discrete mixed membership modeling and continuous latent 
factor modeling for probabilistic dyadic data prediction. The ideas contained herein are based on 
the work of Mackey et al. (2010). In the dyadic data prediction (DDP) problem (Hofmann et al., 
1998), we observe labeled dyads , i.e., ordered pairs of objects, and form predictions for the labels 
of unseen dyads. For example, in the collaborative filtering setting we observe U users, M items, 
and a training set T = {{u n ,j n , with real-valued ratings r„ representing the preferences 

of certain users u n for certain items j n . The goal is then to predict unobserved ratings based on 
users’ past preferences. Other concrete examples of DDP include link prediction in social network 
analysis, binding affinity prediction in bioinformatics, and click prediction in web searches. 

Matrix factorization methods (Rennie and Srebro, 2005; DeCoste, 2006; Salakhutdinov and 
Mnih, 2007; 2008; Takacs et al., 2009; Koren et al., 2009; Lawrence and Urtasun, 2009) represent 
the state of the art for dyadic data prediction tasks. These methods view a dyadic dataset as a 
sparsely observed ratings matrix, R G M. UxM , and learn a constrained decomposition of that matrix 
as a product of two latent factor matrices: R « A J B for A G R DxU , B G R DxM , and D small. 
While latent factor methods perform remarkably well on the DDP task, they fail to capture the 
heterogeneous nature of objects and their interactions. Such models, for instance, do not account 
for the fact that a user’s ratings are influenced by instantaneous mood, that protein interactions are 
affected by transient functional contexts, or even that users with distinct behaviors may be sharing 
a single account or web browser. 

The fundamental limitation of continuous latent factor methods is a result of the static way in 
which ratings are assumed to be produced: a user generates all of his item ratings using the same 
factor vector without regard for context. Discrete mixed membership models, such as latent Dirich- 
let allocation (Blei et al., 2003), were developed to address a similar limitation of mixture models. 
Whereas mixture models assume that each generated object is underlyingly a member of a single 
latent topic, mixed membership models represent objects as distributions over topics. Mixed mem- 
bership dyadic data models such as the mixed membership stochastic blockmodel (Airoldi et al., 
2008) for relational prediction and Bi-LDA (Porteous et al., 2008) for rating prediction introduce 
context dependence by allowing each object to select a new topic for each new interaction. However, 
the relatively poor predictive performance of Bi-LDA suggests that the blockmodel assumption — 
that objects only interact via their topics — is too restrictive. 

In this chapter we develop a fully Bayesian framework for wedding the strong performance and 
expressiveness of continuous latent factor models with the context dependence and topic clustering 
of discrete mixed membership models. In Section 17.2, we provide additional background on ma- 
trix factorization and mixed membership modeling. We introduce our Mixed Membership Matrix 
Factorization (M 3 F) framework in Section 17.3, and discuss procedures for inference and prediction 
under two specific M 3 F models in Section 17.4. Section 17.5 describes experimental evaluation and 
analysis of our models on a variety of real-world collaborative filtering datasets . The results demon- 
strate that mixed membership matrix factorization methods outperform their context-blind counter- 
parts and simultaneously reveal interesting clustering structure in the data. Finally, we present a 
conclusion in Section 17.6. 
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17.2 Background 
17.2.1 Latent Factor Models 

We begin by considering a prototypical latent factor model, Bayesian Probabilistic Matrix Factor- 
ization (BPMF) of Salakhutdinov and Mnih (2008) (see Figure 17.1). Like most factor models, 
BPMF associates with each user u an unknown factor vector a„ £ R D and with each item j an 
unknown factor vector bj £ R ,:> . A user generates a rating for an item by adding Gaussian noise to 
the inner product, r U j = a„ • b ( . We refer to this inner product as the static rating for a user-item 
pair, because, as discussed in the introduction, the latent factor rating mechanism does not model 
the context in which a rating is given and does not allow a user to don different moods or “hats” 
in different dyadic interactions. Such contextual flexibility is desirable for capturing the context- 
sensitive nature of dyadic interactions, and, therefore, we turn our attention to mixed membership 
models. 




FIGURE 17.1 

Graphical model representations of BPMF (top left), Bi-LDA (bottom left), and M 3 F-TIB (right). 


17.2.2 Mixed Membership Models 

Two recent examples of dyadic mixed membership (DMM) models are the mixed membership 
stochastic blockmodel (MMSB) (Airoldi et al., 2008) and Bi-LDA (Porteous et al., 2008) (see Fig- 
ure 17.1). In DMM models, each user u and item j has its own discrete distribution over topics, 
represented by topic parameters 6[j and Oj 1 . When a user desires to rate an item, both the user and 
the item select interaction-specific topics according to their distributions; the selected topics then 
determine the distribution over ratings. 

One drawback of DMM models is the reliance on purely group- wise interactions: one learns 
how a user group interacts with an item group but not how a user group interacts directly with 
a particular item. M 3 F models address this limitation in two ways — first, by modeling interactions 
between groups and specific users or items and second, by incorporating the user-item specific static 
rating of latent factor models. 
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17.3 Mixed Membership Matrix Factorization 

In this section, we present a general mixed membership matrix factorization framework and two 
specific models that leverage the predictive power and static specificity of continuous latent factor 
models while allowing for the clustered context-sensitivity of mixed membership models. In each 
M 3 F model, users and items are endowed both with latent factor vectors (a u and b j) and with 
topic distribution parameters (6^ and Of). To rate an item, a user first draws a topic zf from his 
distribution, representing, for example, his mood at the time of rating (in the mood for romance vs. 
comedy), and the item draws a topic zf from its distribution, representing, for example, the context 
under which it is being rated (in a theater on opening night vs. in a high school classroom). The 
selected user and item topics, z^j = i and zf = k, together with the identity of the user and item, 
u and j, jointly specify a rating bias, tailored to the user-item pair. Different M 3 F models will 
differ principally in the precise form of this contextual bias. To generate a complete rating, the user- 
item-specific static rating a u • b , is added to the contextual bias 3f , along with some noise. Rather 
than learn point estimates under our M 3 F models, we adopt a fully Bayesian methodology and place 
priors on all parameters of interest. Topic distribution parameters (4 and Of are given independent 
exchangeable Dirichlet priors, and the latent factor vectors a, t and b j are drawn independently from 
A f (p u , (A^) -1 ) and J\f (p M , (A M ) _1 ), respectively. As in Salakhutdinov and Mnih (2008), we 
place normal- Wishart priors on the hyperparameters (p u , A u ) and (p M , A M ). Suppose K u is the 
number of user topics and K M is the number of item topics. Then, given the contextual biases 3f , 
ratings are generated according to the following M 3 F generative process: 

A u ~ Wishart(W 0 , v 0 ), A M - Wishart(W 0 , v 0 ) 

p u ~ Af (po, (AoA^) -1 ), p M ~ J\f (p 0 , (AoA M ) _1 ). 

For each u £ {1, . . . , U}: 

a .„ ~ M (p u ,{A u )~ l ) 

9 U U ~ Dir (a/K u ). 

For each j £ { 1 Ad}: 

bj ~AA(/r M ,(A M )- 1 ) 

Of ~ Dir(a/A" M ). 

For each rating r u j : 

Zuj ~ Multi(l, 0”), z% ~ Multi(l, Of) 

r U j I zf = i, zf = k ~J\f (pf. + a„ • bj, a 2 ). 

For each of the following models discussed, we let 0 f denote the collection of all user parameters 
(e.g., a,0 u ,A u ,p u ), 0 A/ denote all item parameters, and 0o denote all global parameters (e.g., 
Wo, vq, po, Ao, a, ctq, cr 2 ). We now describe in more detail the specific forms of two M 3 F models 
and their contextual biases. 

17.3.1 The M 3 F Topic-Indexed Bias Model 

The M 3 F Topic-Indexed Bias (TIB) model assumes that the contextual bias decomposes into a latent 
user bias and a latent item bias. The user bias is influenced by the interaction-specific topic selected 
by the item. Similarly, the item bias is influenced by the user’s selected topic. We denote the latent 
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rating bias of user u under item topic k as cjj and denote the bias for item j under user topic i as dj . 
The contextual bias for a given user-item interaction is then found by summing the two latent biases 
and a fixed global bias, Xo'- 1 

Puj = Xo + c « + d) ■ 

Topic-indexed biases c* and dj are drawn independently from Gaussian priors with variance Ug 
and means Co and do, respectively. Figure 17.1 compares the graphical model representations of 
M 3 F-TIB, BPMF, and Bi-LDA. Note that M 3 F-TIB reduces to BPMF when K u and K M are both 
zero. Intuitively, the topic-indexed bias model captures the “ Napoleon Dynamite effect,” whereby 
certain movies provoke strongly differing reactions from otherwise similar users (Thompson, 2008). 
Each user-topic-indexed bias dj represents one of K u possible predispositions towards liking or 
disliking each item in the database, irrespective of the static latent factor parameterization. Thus, 
in the movie-recommendation problem, we expect the variance in user reactions to movies such as 
Napoleon Dynamite to be captured in part by a corresponding variance in the bias parameters dj 
(see Section 17.5). Moreover, because the model is symmetric, each rating is also influenced by 
the item- topic -indexed bias c„. This can be interpreted as the predisposition of each perceived item 
class towards being liked or disliked by each user in the database. Finally, because M 3 F-TIB is a 
mixed membership model, each user and item can choose a different topic and hence a different 
bias for each rating (e.g., when multiple users share a single account). 


17.3.2 The M 3 F Topic-Indexed Factor Model 

The M 3 F Topic-Indexed Factor (TIF) model assumes that the joint contextual bias is an inner prod- 
uct of topic -indexed factor vectors, rather than the sum of topic-indexed biases as in the TIB model. 
Each item topic k maintains a latent factor vector cjj £ R IJ for each user, and each user topic i main- 
tains a latent factor vector dj £ R ,:> for each item. Each user and each item additionally maintains 
a single static rating bias, £ u and Xj- respectively. The joint contextual bias is formed by summing 
the user bias, the item bias, and the inner product between the topic-indexed factor vectors: 


3%=&+Xj+ c£-dj. 


The topic-indexed factors and d l - are drawn independently from J\f ypF , ( A u ) and 
J\T A M ) -1 ^ priors, and conjugate normal- Wishart priors are placed on the hyperparame- 

ters (p u ,A U ) and (/j M , A M ). The static user and item biases, £ u and Xj, are drawn independently 
from Gaussian priors with variance Og and means £g and xo> respectively. 2 

Intuitively, the topic-indexed factor model can be interpreted as an extended matrix factorization 
with both global and local low-dimensional representations. Each user u has a single global factor 
a„ but K u local factors cj; similarly, each item j has both a global factor b, and multiple local 
factors dj. A strength of latent factor methods is their ability to discover globally predictive intrinsic 
properties of users and items. The topic-indexed factor model extends this representation to allow 
for intrinsic properties that are predictive in some but perhaps not all contexts. For example, in the 
movie-recommendation setting, is Lost In Translation a dark comedy or a romance film? The answer 
may vary from user to user and thus may be captured by different vectors dj for each user-indexed 
topic. 


*The global bias, xo> is suppressed in the remainder of the paper for clarity. 
^Static biases £ and \ are suppressed in the remainder of the paper for clarity. 



356 


Handbook of Mixed Membership Models and Its Applications 


17.4 Inference and Prediction 

The goal in dyadic data prediction is to predict unobserved ratings r (hl given observed ratings r (v) . 
As in Salakhutdinov and Mnih (2007; 2008) and Takacs et al. (2009), we adopt root mean squared 
error (RMSE) 3 as our primary error metric and note that the Bayes optimal prediction under RMSE 
loss is the posterior mean of the predictive distribution p(r (h, |r (v) , Oo). 

In our M 3 F models, the predictive distribution over unobserved ratings is found by integrat- 
ing out all topics and parameters. The posterior distribution p(z r/ , z M , 0 r ' , 0 M |r (v \ 0o) is thus 
our main inferential quantity of interest. Unfortunately, as in both LDA and BPMF, analytical 
computation of this posterior is infeasible due to complex coupling in the marginal distribution 
p( r ( v )|0 o ) (Blei et al., 2003; Salakhutdinov and Mnih, 2008). 

17.4.1 Inference via Gibbs Sampling 

In this work, we use a Gibbs sampling MCMC procedure (Geman and Geman, 1984) to draw sam- 
ples of topic and parameter variables {(z u ^\ z M ^\ Q u ^>, from their joint posterior. 

Our use of conjugate priors ensures that each Gibbs conditional has a simple closed form. 

Algorithm 1 displays the Gibbs sampling algorithm for the M 3 F-TIB model; the M 3 F-TIF Gibbs 
sampler is similar. The exact conditional distributions of both models are presented in the Appendix. 
Note that we choose to sample the topic parameters 0 U and 0 M rather than integrate them out as 
in a collapsed Gibbs sampler (see, e.g., Porteous et al., 2008). This decision allows us to sample 
the interaction-specific topic variables in parallel. Indeed, each loop in Algorithm 1 corresponds to 
a block of parameters that can be sampled in parallel. In practice, such parallel computation yields 
substantial savings in sampling time for large-scale dyadic datasets. 

17.4.2 Prediction 

Given posterior samples of parameters, we can approximate the true predictive distribution by the 
Monte Carlo expectation 


p(r (h y v \0o) = ^E £ p(z u ,z M \Q u ^,@ M ^) 

t= 1 Z U ,Z M 

p( rW|z c/ ,z M ,0 c/ W,0 M W,0o), 


(17.1) 


where we have integrated over the unknown topic variables. Equation (17.1) yields the following 
posterior mean prediction for each user-item pair under the M 3 F-TIB model: 




K 1 ' 


I< u 


U(t ) 


T 


t = l 


l 


Under the M 3 F-TIF model, posterior mean prediction takes the form 

t ( k u k m , 

y af.bf +Ei:os < ‘ | c?" df> . 


t = 1 


i—1 k — 1 


3 For work linking improved RMSE with better top-K recommendation rankings, see Koren (2008). 
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Algorithm 1: Gibbs Sampling for M 3 F-TIB. 


17.5 Experimental Evaluation 

We evaluate our models on several movie rating collaborative filtering datasets, including the Netflix 
Prize dataset, 4 the EachMovie dataset, and the 1M and 10M MovieLens datasets. 5 The Netflix Prize 
dataset contains 100 million ratings in {1, ... ,5} distributed across 17,770 movies and 480,189 
users. The EachMovie dataset contains 2.8 million ratings in {1, ... ,6} distributed across 1,648 
movies and 74,424 users. The 1M MovieLens dataset has 6,040 users, 3,952 movies, and 1 mil- 
lion ratings in {1, . . . , 5}. The 10M MovieLens dataset has 10,681 movies, 71,567 users, and 10 
million ratings on a .5 to 5 scale with half-star increments. In all experiments, we set Wq equal 
to the identity matrix, vq equal to the number of static matrix factors, /it, equal to the all-zeros 
vector, xo equal to the mean rating in the dataset, and (Ao, cr 2 , cTq) = (10, .5, .1). For M 3 F-TIB 
experiments, we set (co,do,a) = (0,0,10000), and for M 3 F-TIF, we set Wo equal to the iden- 
tity matrix, Vq equal to the number of topic-indexed factors, /m, equal to the all-zeros vector, and 
(£>, £o> <*, Ao) = (2, 0, 10, 10000). Free parameters were selected by grid search on an EachMovie 

4 See http://www.netfiixprize.com/. 

5 See http://www.grouplens.org/. 
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hold-out set, disjoint from the test sets used for evaluation. Throughout, reported error intervals are 
of plus or minus one standard error from the mean. 

17.5.1 1M MovieLens and EachMovie Datasets 

We first evaluated our models on the smaller datasets, 1M MovieLens and EachMovie. We con- 
ducted the “weak generalization” ratings prediction experiment of Marlin (2004), where, for each 
user in the training set, a single rating was withheld for the test set. All reported results were aver- 
aged over the same three random train-test splits used in Marlin (2003), Marlin (2004), Rennie and 
Srebro (2005), DeCoste (2006), Park and Pennock (2007), and Lawrence and Urtasun (2009). Our 
Gibbs samplers were initialized with draws from the prior and ran for 3000 samples for M 3 F-TIB 
and 512 samples for M 3 F-TIF. No samples were discarded for “burn-in.” 

Table 17.1 reports the predictive performance of our models for a variety of static factor dimen- 
sionalities ( D ) and topic counts (K u , K M ). We compared all models against BPMF as a baseline 
by running the M 3 F-TIB model with K u and K M set to zero. For comparison with previous re- 
sults that report the normalized mean average error (NMAE) of Marlin (2004), we additionally 
ran M 3 F-TIB with (D,K U ,K M ) = (300,2, 1) on EachMovie and achieved a weak RMSE of 
(1.0878 ± 0.0025) and a weak NMAE of (0.4293 ± 0.0013). 

On both the EachMovie and the 1M MovieLens datasets, both M 3 F models systematically out- 
performed the BPMF baseline for almost every setting of latent dimensionality and topic counts. 
For D = 20, increasing K u to 2 provided a boost in accuracy for both M 3 F models equivalent to 
doubling the number of BPMF static factor parameters ( D = 40). We also found that the M 3 F-TIB 
model outperformed the more recent Gaussian process matrix factorization model of Lawrence and 
Urtasun (2009). 

The results indicate that the mixed membership component of M 3 F offers greater predictive 
power than simply increasing the dimensionality of a pure latent factor model. While the M 3 F-TIF 
model sometimes failed to outperform the BPMF baseline due to overfitting, the M 3 F-TIB model 
always outperformed BPMF regardless of the setting of K u , K M , or D. Note that the increase in 


TABLE 17.1 

1M MovieLens and EachMovie RMSE scores for varying static factor dimensionalities and topic 
counts for both M 3 F models. All scores are averaged across 3 standardized cross-validation splits. 
Parentheses indicate topic counts (K u , K M ) . For M 3 F-TIF, D = 2 throughout. L&U (2009) refers 
to Lawrence and Urtasun (2009). Best results for each D are boldened. Asterisks indicate significant 
improvement over BPMF under a one-tailed, paired t-test with level 0.05. 

1M MovieLens EachMovie 


Method 

D=10 

D=20 

D=30 

D=40 

D=10 

D=20 

D=30 

D=40 

BPMF 

0.8695 

0.8622 

0.8621 

0.8609 

1 . 1229 

1.1212 

1.1203 

1.1163 

M 3 F-TIB (1,1) 

0.8671 

0.8614 

0.8616 

0.8605 

1.1205 

1.1188 

1.1183 

1.1168 

M 3 F-TIF (1,2) 

0.8664 

0.8629 

0.8622 

0.8616 

1.1351 

1.1179 

1.1095 

1.1072 

M 3 F-TIF (2,1) 

0.8674 

0.8605 

0.8605 

0.8595 

1.1366 

1.1161 

1.1088 

1.1058 

M 3 F-TIF (2,2) 

0.8642 

0.8584* 

0.8584 

0.8592 

1.1211 

1.1043 

1.1035 

1.1020 

M 3 F-TIB (1,2) 

0.8669 

0.8611 

0.8604 

0.8603 

1.1217 

1.1081 

1.1016 

1.0978 

M 3 F-TIB (2,1) 

0.8649 

0.8593 

0.8581* 

0.8577* 

1.1186 

1.1004 

1.0952 

1.0936 

M 3 F-TIB (2,2) 

0.8658 

0.8609 

0.8605 

0.8599 

1.1101* 

1.0961* 

1.0918* 

1.0905* 

L&U (2009) 

0.8801 (RBF) 

0.8791 (Linear) 

1.1111 (RBF) 

1.0981 (Linear) 
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the number of parameters from the BPMF model to the M 3 F models is independent of D (M 3 F- 
TIB requires (U + M)(K U + I\ M ) more parameters than BPMF with equal D ), and therefore the 
ratio of the number of parameters of BPMF and M 3 F approaches 1 if D increases while K u , I\ M , 
and D are held fixed. Nonetheless, the modeling of joint contextual bias in the M 3 F-TIB model 
continues to improve predictive performance even as D increases, suggesting that the M 3 F-TIB 
model is capturing aspects of the data that are not captured by a pure latent factor model. 

Finally, because the M 3 F-TIB model offered superior performance to the M 3 F-TIF model in 
most experiments, we focus on the M 3 F-TIB model in the remainder of this section. 

17.5.2 10M MovieLens Dataset 

For the larger datasets, we initialized the Gibbs samplers with MAP estimates of a and b under 
simple Gaussian priors, which we trained with stochastic gradient descent. This is similar to the 
PMF initialization scheme of Salakhutdinov and Mnih (2008). All other parameters were initialized 
to their model means. 

For the 10M MovieLens dataset, we averaged our results across the r a and n , train-test splits 
provided with the dataset after removing those test set ratings with no corresponding item in the 
training set. For comparison with the Gaussian process matrix factorization model of Lawrence 
and Urtasun (2009), we adopted a static factor dimensionality of D = 10. Our M 3 F-TIB model 
with (K u ,K m ) = (4,1) achieved an RMSE of ( 0.8447 ± 0.0095), representing a significant 
improvement ( p = 0.034) over BPMF with RMSE ( 0.8472 ± 0.0093) and a substantial increase in 
accuracy over the Gaussian process model with RMSE ( 0.8740 ± 0.0197). 

17.5.3 Netflix Prize Dataset 

The unobserved ratings for the 100 million dyad Netflix Prize dataset are partitioned into two stan- 
dard sets, known as the Quiz Set and the Test Set. Prior to September 2009, public evaluation was 
only available on the Quiz Set, and, as a result, most prior published “test set” results were eval- 
uated on the Quiz Set. In Table 17.2, we compare the performance of BPMF and M 3 F-TIB with 
(K u , K m ) = (4, 1) on the Quiz Set, the Test Set, and on their union (the Qualifying Set), across a 
wide range of static dimensionalities. We also report running times of our Matlab/MEX implemen- 
tation on dual quad-core 2.67GHz Intel Xeon CPUs. We used the initialization scheme described in 
Section 17.5.2 and ran the Gibbs samplers for 500 iterations. 

In addition to outperforming the BPMF baselines of comparable dimensionality, the M 3 F-TIB 
models routinely proved to be more accurate than higher-dimensional BPMF models with longer 
running times and many more learned parameters. This major advantage of M 3 F modeling is high- 
lighted in Figure 17.2, which plots error as a function of the number of parameters modeled per user 
or item ( D + K u + K M ). 

To determine for which users and movies our models were providing the most improvement over 
BPMF, we divided the Qualifying Set into bins based on the number of ratings associated with each 
user and movie in the database. Figure 17.3 displays the improvements of BPMF/60, M 3 F-TIB/40, 
and M 3 F-TIB/60 over BPMF/40 as a function of the number of user or movie ratings. Consistent 
with our expectations, we found that adopting an M 3 F model yielded improved accuracy for movies 
of small rating counts, with the greatest improvement over BPMF occurring for those high-variance 
movies with relatively few ratings. Moreover, the improvements realized by either M 3 F-TIB model 
uniformly dominated the improvements realized by BPMF/60 across movie rating counts. At the 
same time, we found that the improvements of the M 3 F-TIB models were skewed toward users with 
larger rating counts. 
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TABLE 17.2 

Netflix Prize results for BPMF and M 3 F-TIB with (K u ,K M ) = (4, 1). Flidden ratings are par- 
titioned into Quiz and Test sets; the Qualifying set is their union. Best results in each block are 
boldened. Reported times are average running times per sample. 


Method 

Test 

Quiz 

Qual 

Time 

BPMF/ 15 
TIB/15 

0.9125 

0.9093 

0.9117 

0.9086 

0.9121 

0.9090 

27.8s 

46.3s 

BPMF/30 

TIB/30 

0.9049 

0.9018 

0.9044 

0.9012 

0.9047 

0.9015 

38.6s 

56.9s 

BPMF/40 

TIB/40 

0.9029 

0.8992 

0.9026 

0.8988 

0.9027 

0.8990 

48.3s 

70.5s 

BPMF/60 

TIB/60 

0.9004 

0.8965 

0.9001 

0.8960 

0.9002 

0.8962 

94.3s 

97.0s 

BPMF/120 

TIB/120 

0.8958 

0.8937 

0.8953 

0.8931 

0.8956 

0.8934 

273.7s 

285.2s 

BPMF/240 

TIB/240 

0.8939 

0.8931 

0.8936 

0.8927 

0.8938 

0.8929 

1152.0s 

1158.2s 



FIGURE 17.2 

RMSE performance of BPMF and M 3 F-TIB with ( K u . K M ) = (4, 1) on the Netflix Prize Quali- 
fying set as a function of the number of parameters modeled per user or item. 
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M 3 F & The Napoleon Dynamite Effect 

In our introduction to the M 3 F-TIB model we discussed the joint contextual bias as a potential 
solution to the problem of making predictions for movies that have high variance. To investigate 
whether or not M 3 F-TIB achieved progress towards this goal, we analyzed the correlation between 
the improvement in RMSE over the BPMF baseline and the variance of ratings for the 1000 most 
popular movies in the database. While the improvements for BPMF/60 were not significantly corre- 
lated with movie variance ( p = —0.016), the improvements of the M 3 F-TIB models were strongly 
correlated with p = 0.117(p < 0.001) and p = 0.15 ( p < 10 -7 ) for the (40,4, 1) and (60,4, 1) 
models, respectively. These results indicate that a strength of the M 3 F-TIB model lies in the ability 
of the topic-indexed biases to model variance in user biases toward specific items. 

To further illuminate this property of the model, we computed the posterior expectation of the 
movie bias parameters, E(dj|r (v) ), for the 200 most popular movies in the database. For these 
movies, the variance of E(d* |r (v) ) across topics and the variance of the ratings of these movies 
were very strongly correlated (p = 0.682, p < 10 _1 °). The five movies with the highest and lowest 
variance in E(<i* |r (v) ) across topics are shown in Table 17.3. The results are easily interpretable, 
with high-variance movies such as Napoleon Dynamite dominating the high-variance positions and 
universally acclaimed blockbusters dominating the low-variance positions. 


TABLE 17.3 

Top 200 movies from the Netflix Prize dataset with the highest and lowest cross-topic variance 
in E(dj |r (v) ). Reported intervals are of the mean value of E(d* |r M ), plus or minus one standard 
deviation. 


Movie Title 

E(d*|r (v) ) 

Napoleon Dynamite 

-0.11 ±0.93 

Fahrenheit 9/11 

-0.06 ± 0.90 

Chicago 

-0.12 ±0.78 

The Village 

-0.14 ±0.71 

Lost in Translation 

-0.02 ± 0.70 

LotR: The Fellowship of the Ring 

0.15 ±0.00 

LotR: The Two Towers 

0.18 ±0.00 

LotR: The Return of the King 

0.24 ± 0.00 

Star Wars: Episode V 

0.35 ± 0.00 

Raiders of the Lost Ark 

0.29 ± 0.00 


17.6 Conclusion 

In this chapter, we developed a fully Bayesian dyadic data prediction framework for integrating the 
complementary approaches of discrete mixed membership modeling and continuous latent factor 
modeling. We introduced two mixed membership matrix factorization models, presented MCMC 
inference procedures, and evaluated our methods on the EachMovie, MovieLens, and Netflix Prize 
datasets. On each dataset, we found that M 3 F-TIB significantly outperformed BPMF and other state- 
of-the-art baselines, even when fitting fewer parameters. We further discovered that the greatest 
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FIGURE 17.3 

RMSE improvements over BPMF/40 on the Netflix Prize as a function of movie or user rating 
count. Left: Improvement as a function of movie rating count. Each x-axis label represents the 
average rating count of 1/6 of the movie base. Right: Improvement over BPMF as a function of user 
rating count. Each bin represents 1/8 of the user base. 

performance improvements occurred for the high-variance, sparsely-rated items, for which accurate 
DDP is typically the hardest. 


Appendix: Gibbs Sampling Conditionals for M 3 F Models 
The M 3 F-TIB Model 

In this section, we specify the conditional distributions used by the Gibbs sampler for the M 3 F-TIB 
model. 

Nonna I-Wisha rt Pa rame te rs 


A u \rest\{n u } ~ Wishart((W 0 1 + Y] (a u - a) (a u - a)* + J /j, 0 - a )(/x 0 

A 0 + u 


a)*) 1 ,i / o + C), where a = 


1V1 _ _ \ M 

A m | rest\{n M } ~ Wishart((Wo 1 + ^ (hj - b)(bj - b)* + x f + M (Po - b)(^ 0 
b) 4 ) -1 ; v o + M), where b = ^ h j- 
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Bias Parameters 

For each u and i £ K M }, 


c l u \rest ~ J\f 


% + J2jev u ~ Xo ~ d* u ' - a,, • bj ) 


V 


1 V J_ Z M 
„2 T 2^jev u a 2 Uji 


J_ 4 . V J_ z Af 

<j2 ^ ^jeV u a 2 Z -u.ji 


For each j and i £ K u }, 


d 1 , | rest ~ A/ 


+ J2u:jev n ~£Z Z uji( r uj - XO - Cu UJ - a u • bj) 


1 1 y' 1 ~U 

a-? 1 2-^u:j EV U a 2 Z uji 




1 ? u 

:jev u Z uji 


Static Factors 
For each u , 


Z M z U 

Cu 3 


( 1 M U 

u p u + Y, 4 b Arm ~ Xo - c, r - d*“')). (O' 

jev„ a 

where A^* = (A* 7 + £ ieVu ^ b i( b j)*)- 
For each j, 

bj\rest ~ A/" I (Af r *)~ 1 (A M p M + ^ ^ a„ (r UJ - - xo - 
\ u:jev u a 

where A“* = (A M + Y, U :jev u ^ a «( a «)*)- 
Dirichlet Parameters 

For each u, 9 % \ rest ~ Dir(a/K u + J2jev u z uj )• 

For each j, 9f I \rest ~ Dir(a/K M + J2 u -.jev u z uj)- 
Topic Variables 

For each u and j £ V u , z^j \ rest ~ Multi(l, 6 ^* ), where 

3 ir* „ a u „„„ | ~ Xo ~ c M J — dj — a„ • bj) 2 


<x 0 ui exp -- 




2cr 2 


For each j and u : j £ V u , \ rest ~ Multi(l, 9^*), where 




<* exp | - 


(r U j ~ Xo - 4 - - a M • b ,-) 2 

2a 2 


The M 3 F-TIF Model 

In this section, we specify the conditional distributions used by the Gibbs sampler for the M 3 F-TIF 
model. 
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Normal -Wishart Parameters 

u \ TT 

A u \ rest\{n u } ~ Wishart((W^ 1 + V] (a„ - a) (a„ - a)* + ° (mo - a)(/z 0 - 

Xo + U 

a) 4 ) -1 , + U), where a = A Jfu = i a «- 

A M | rest\{p M } ~ Wishart ((W^ 1 + ^ (b, -b)(bj - b) 4 + x ^ M iPo - b)Oo - 

i=i 

b) 4 ) -1 , + M), where b = ± YljLi b j. 

p U \ rest ~ AA ^ V ° + ^= iau , (A C/ (A 0 + U))~^j . 

H M \rest ~ A/ - ^ 0/ ( aM ( a o + Af))' 1 j . 

77 A \r,UK M 

A u \rest\{fi u } ~ Wishart((Wg 1 + ^ JZ ( c « “ s )« “ 5 )* + T 777777 (Mo ~ “ 

u=l i=l A 0 + 47 A 

c) 4 ) -1 , i>o + UK m ), where c = YZ=i EZ i <■ 

f f E _ _ \ TSU 

A M \rest\{p M } ~ Wishart((Wo 1 +^^(d* - d)(d} -d)‘+~ — — (Ao - d) (/t 0 - 

A 0 + AIK U 

d) T\ t>o + MAT 17 ), where d = Ejii E^i d j- 

EdO + Eu=lE l= l < (A t/ ( ^ ] _ 

x 0 + uk m j 

f M \ rest ~ A r( A °^+^ 1 ^ ld i (A m (A 0 + MAT 17 ))" 1 ) . 

I Ao + ma: 17 J 


Bias Parameters 
For each u, 

£«| rest ~ A/” 

For each j, 

Xj\rest ~ A/ - 


io_ 


E 


M U 


jev u a 2 


1 fuj - Xj - a« • b j - c?’ ■ d ~ U3 ) 


■E, 


{ 


■E 


u:jeV u a 2 ' u 3 


jev u 


,M U 

1 ( r u j -e.u-n.u- bj - Cu 3 ■ d f 3 ) 


■E 


jev u 


V 


- y . 

/— m: 7 


u:j£V u 


+ Eu: 


:jev„ 


Static Factors 
For each u, 


( 1 M 17 \ 

( A u*)~ 1 ( AU f- U + E ^2 b J'( r «J -Zu-Xj ~ c ‘u 3 ■ d ~ UJ )), (A^*) _1 I , 
where A%* = (A u + J 2 jeVu ^bj(b y) 4 ). 
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For each j. 


( 1 -U \ 

^ u {r uj -Zu~ Xj ~ ■ d/ J )), (Af *)- 1 

u:j'ev„ / 


where Af* = (A M + Y, u -.jev u ^ a «( a «)*)- 

Topic-indexed Factors 

For each u and each i £ 1, . . . , K M , 


4l rest ~ AT ^(A£>)- 1 (A t '/i t ' + g (r^ - - X,' - a u • bj)), (A^*)" 1 j 


where = (A U +^ jeVu &&*?*&?*)*)■ 
For each j and each i £ 1, . . . , K u , 


dj\rest ~ JV | (Ajf*)^ 1 (A M p M + £ \ 


Zuji c u : (r uj -£u- Xj ~ a u • bj)), (A*f *)" 


u:j£V u 


M M 


where Af* = (k M + J2 u -.jev u ( c « )*)• 

Dirichlet Parameters 

For each u, 9 % \ rest ~ Dir(a/I< u + J2jev u z uj )• 

For each j, 9f\ rest ~ Dir(a/K M + J2 u -.jev u z uj)- 
Topic Variables 

For each u and j £ V u , \ rest ~ Multi(l, 6 ^* ), where 


«£/* nt/ I Xj a « ' t»j 

cx 9 ui exp 


dj ) 2 


2cr 2 


For each j and u : j G V u , z^ \ rest ~ Multi(l, 0^*), where 


UJ 


o M * x qM I ( r «j £« Xj a » • b j <’ d / J ) 2 
^ ex P | 2^2 
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Mixed membership models have shown great promise in analyzing genetics, text documents, and 
social network data. Unlike most existing likelihood-based approaches to learning mixed member- 
ship models, we present a discriminative training method based on the maximum margin principle to 
utilize supervising side information such as ratings or labels associated with documents to discover 
more predictive low-dimensional representations of the data. By using the linear expectation oper- 
ator, we can derive efficient variational methods for posterior inference and parameter estimation. 
Empirical studies on the 20 Newsgroup dataset are provided. Our experimental results demonstrate 
qualitatively and quantitatively that the max-margin-based mixed membership model (topic model 
in particular for modeling text): 1) discovers sparse and highly discriminative topical representa- 
tions; 2) achieves state-of-the-art prediction performance; and 3) is more efficient than existing 
supervised topic models. 
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18.1 Introduction 

Mixed membership models are hierarchical extensions of finite mixture models where each 
data point exhibits multiple components. They have been successfully applied to analyze genet- 
ics (Pritchard et al., 2000), social networks (Airoldi et al., 2008), and text documents. For text anal- 
ysis, probabilistic latent aspect models such as latent Dirichlet allocation (LDA) (Blei et ah, 2003) 
have recently gained much popularity for stratifying a large collection of documents by projecting 
every document into a low-dimensional space spanned by a set of bases that capture the semantic 
aspects, also known as topics , of the collection. LDA posits that each document is an admixture 
of latent topics, of which each topic is represented as a unigram distribution over a given vocabu- 
lary. The document-specific admixture proportion vector 6 is modeled as a latent Dirichlet random 
variable, and can be regarded as a low-dimensional representation of the document in a topical 
space. This low-dimensional representation can be used for downstream tasks such as classifica- 
tion, clustering, or merely as a tool for structurally visualizing the otherwise unstructured document 
collection. 

LDA is typically built on a discrete bag-of-words representation of input contents, which can 
be texts (Blei et ah, 2003), images (Fei-Fei and Perona, 2005), or multi-type data (Blei and Jordan, 
2003). However, in many practical applications, we can easily obtain useful side information besides 
the document or image contents. For example, when online users post their reviews for products or 
restaurants, they usually associate each review with a rating score or a thumbs-up/thumbs-down 
opinion; web sites or pages in the public Yahoo! Directory 1 can have their categorical labels; and 
images in the LabelMe (Russell et ah, 2008) database are organized by a visual ontology and addi- 
tionally each image is associated with a set of annotation tags. Furthermore, there is an increasing 
trend towards using online crowdsourcing services (such as Amazon Mechanical Turk 2 ) to collect 
large collections of labeled data with a reasonably low price. Such side information often provides 
useful high-level or direct summarization of the content, but it is not directly utilized in the original 
LDA to influence topic inference. One would expect that incorporating such information into latent 
aspect modeling could guide a topic model towards discovering secondary (or non-dominant) but 
semantically more salient statistical patterns (Chechik and Tishby, 2002) that may be more interest- 
ing or relevant to the user’s goal, such as making predictions on unlabeled data. 

To explore this potential, developing new topic models that appropriately capture side informa- 
tion mentioned above has recently gained increasing attention. Representative attempts include the 
supervised topic model (sLDA) (Blei and McAuliffe, 2007), which captures real-valued document 
ratings as a regression response; multi-class sLDA (Wang et ah, 2009), which directly captures dis- 
crete labels of documents as a classification response; and discriminative LDA (DiscLDA) (Lacoste- 
Julien et ah, 2008), which also performs classification, but with a mechanism different from that of 
sLDA. All these models focus on the document-level side information such as document categories 
or review rating scores to supervise model learning. More variants of supervised topic models can 
be found in a number of applied domains, such as the aspect rating model (Titov and McDonald, 
2008) for predicting ratings for each aspect of a hotel. In computer vision, various supervised topic 
models have been designed for understanding complex scene images (Sudderth et ah, 2005; Fei-Fei 
and Perona, 2005). 

It is worth pointing out that among existing supervised topic models for incorporating side infor- 
mation, there are two classes of approaches, namely, downstream supervised topic models (DSTM) 
and upstream supervised topic models (USTM). In a DSTM, the response variable is predicted 
based on the latent representation of the document, whereas in a USTM the response variable 
is being conditioned to generate the latent representation of the document. Examples of USTM 

1 See http://dir.yahoo.com/. 

-See https://www.mturk.com/. 
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include DiscLDA and the scene understanding models (Sudderth et al., 2005; Fei-Fei and Perona, 
2005), whereas sLDA is an example of DSTM. Another distinction between existing supervised 
topic models is the training criterion, or more precisely, the choice of objective function in the 
optimization-based learning. The sLDA models are trained by maximizing the joint likelihood of 
the content data (e.g., text or image) and the responses (e.g., labeling or rating), whereas DiscLDA 
models are trained by maximizing the conditional likelihood of the responses given contents. 

In this chapter, we present maximum entropy discrimination latent Dirichlet allocation 
(MedLDA), a supervised topic model leveraging the maximum margin principle for making more 
effective use of side information during estimation of latent topical representations. Unlike exist- 
ing supervised topic models mentioned above, MedLDA employs an arguably more discrimina- 
tive max-margin learning technique within a probabilistic framework; and unlike the commonly 
adopted two-stage heuristic which first estimates a latent topic vector for each document using a 
topic model and then feeds them to another downstream prediction model, MedLDA integrates the 
mechanism behind max-margin prediction models (e.g., SVMs) with the mechanism behind hier- 
archical Bayesian topic models (e.g., LDA) under a unified constrained optimization framework. It 
employs a composite objective motivated by a tradeoff between two components — the negative log- 
likelihood of an underlying topic model which measures the goodness-of-fit for document contents, 
and a measure of prediction error on training data. It then seeks a regularized posterior distribution 
of the predictive function in a feasible space defined by a set of expected max-margin constraints 
generalized from the SVM-style margin constraints. Our proposed approach builds on earlier de- 
velopments in maximum entropy discrimination (MED) (Jaakkola et al., 1999; Jebara, 2001) and 
partially observed maximum entropy discrimination Markov network (PoMEN) (Zhu et al., 2008). 
In MedLDA, because of the influence of both the likelihood function over content data and max- 
margin constraints induced by the side information, the discovery of latent topics is therefore cou- 
pled with the max-margin estimation of model parameters. This interplay can yield latent topical 
representations that are more discriminative and more suitable for supervised prediction tasks, as 
we demonstrate in the experimental section. We also present an efficient variational approach for in- 
ference under MedLDA, with a running time comparable to that of an unsupervised LDA and lower 
than other likelihood-based supervised LDAs. This advantage stems from the fact that MedLDA can 
directly optimize a margin-based loss instead of a likelihood-based one, and thereby avoids deal- 
ing with the normalization factor resultant from a full probabilistic generative formulation, which 
generally makes learning harder. 

Finally, although we have focused on topic models, we emphasize that the methodology we 
develop is quite general and can be applied to perform max-margin learning for various mixed 
membership models, including the relational model (Airoldi et al., 2008). Moreover, the ideas can 
be extended to nonparametric Bayesian models (Zhu et al., 2011a; Zhu, 2012; Xu et al., 2012). 

The rest of this chapter is structured as follows. Section 18.2 introduces the preliminaries that 
are needed to present MedLDA. Section 18.3 presents the MedLDA model for classification, to- 
gether with an efficient algorithm. Section 18.4 presents empirical studies of MedLDA. Finally, 
Section 18.5 concludes this chapter with future research directions discussed. 


18.2 Preliminaries 

We begin with a brief overview of the fundamentals of mixed membership models, support vector 
machines, and maximum entropy discrimination (Jaakkola et al., 1999), which constitute the major 
building blocks of the proposed MedLDA. 
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18.2.1 Hierarchical Bayesian Mixed Membership Models 

A general formulation of mixed membership models was presented in Erosheva et al. (2004), which 
characterizes these models in terms of assumptions at four levels: population, subject, latent vari- 
able, and sampling scheme. Population level assumptions describe the general structure of the popu- 
lation that is common to all subjects. Subject level assumptions specify the distribution of observed 
responses given individual membership scores. Latent variable level assumptions are about whether 
the membership scores are fixed or random. Finally, the last level of assumptions specify the number 
of distinct observed characteristics (attributes) and the number of replications for each characteristic. 

(1) Population Level. Assume that there are K components or basis subpopulations in the popula- 
tions of interest. For each subpopulation k, we denote by f{xd n \Pkn ) the probability distribution 
of the nth response variable for the dth subject, where fir is an Af -dimensional vector of pa- 
rameters. Within a subpopulation, the observed responses are assumed to be independent across 
subjects and characteristics. 

(2) Subject Level. For each subject d, a membership vector 9d = ((Kt\ . . . . , 0,ik) represents the 
degrees of the subject’s membership to the various subpopulations. The distribution of the ob- 
served response x dn for each subject given the membership scores 0 d is then p(xd n \0d) — 

Sdkf{xdn\fikn)- Conditional on the mixed membership scores, the response variables Xd n 
are independent of each other, and also independent across subjects. 

(3) Latent Variable Level. With respect to the membership scores, one could assume they 
are either fixed unknown constants or random realizations from some underlying distribu- 
tion. For Bayesian mixed membership models, which are our focus, the latter strategy is 
adopted, that is, assume that Qd are realizations of latent variables from some distribution 
D a , parameterized by a vector a. The probability of observing ,xv/„ is then p(.c,;„ |«. fi) = 

/ (Sfc Qdkf{Xdn\Pkn))D a {dB). 

(4) Sampling Scheme Level. Suppose R independent replications of M distinct characteristics are 
observed for the dth subject. The conditional probability of observing x^ = { x r dl , . . . , x dM }^ =1 
given the parameters is then 

/ M R K \ 

p(x d \a,fi) = / nnz Odkf{x r dn \fiku) D a {dO). (18.1) 

\ n =lr=lfc=l / 

Hierarchical Bayesian mixed membership models have been widely used in analyzing various 
forms of data, including discrete text documents (Blei et al., 2003), population genetics (Pritchard 
et al., 2000), social networks (Airoldi et al., 2008), and disability survey data (Erosheva, 2003). 
Below, we will study the mixed membership models for discrete text documents (i.e., topic models) 
as a test bed for mixed membership modeling ideas. But we emphasize that the methodology we 
will develop is applicable to a broad range of hierarchical Bayesian models. 

18.2.2 Hierarchical Bayesian Topic Models 

Latent Dirichlet allocation (LDA) (Blei et al., 2003) is a Bayesian mixed membership model for 
modeling discrete text documents. In LDA, the components or subpopulations are topics, of which 
each topic is a multinomial distribution over the M words in a given vocabulary, i.e., fir £ V, where 
V is the space of probability distributions with an appropriate dimension which will be omitted when 
the context is clear; and the membership scores Qd for document d is a mixing proportion vector over 
the K topics. We denote the vector of words appearing in document d as = (wai, ■ ■ ■ , WdN d )- 
For the same word that appears for multiple times, there are multiple place holders in w d- Thus, w,/ 
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can be seen as a replication of appearing words. Let /3 = [j3 \ ; . . . ; (3k\ denote the K x M matrix of 
topic parameters. Under LDA, the likelihood of a document corresponds to the following generative 
process: 

1. For document d, draw a topic mixing proportion vector 9 4 . 9f\ a ~ Dir(a); 

2. For the nth word in document d, where 1 < n < N ( i, 

(a) Draw a topic assignment Zd n according to O 4 : Zdn \0d ~ Mult {9 d )\ 

(b) Draw the word Wd n according to z d n- Wdn\zdn, P ~ Mult(/3 z<Jn ), 

where z ( j ;n is a I\ -dimensional indicator vector (i.e., only one element is 1; all others are 0), an 
instance of the topic assignment random variable 2',/ n , and Dir (a) is a A -dimensional Dirichlet 
distribution, parameterized by a. With a little abuse of notations, we have used f3 Zdn to denote the 
topic that is selected by the non-zero element of z dn - 

Let z d = {zdn}n=i denote the set of topic assignments for all the words in document d. For a 
corpus V that contains D documents, we let ® = { 6 d} d =v Z = 1, and W = 

According to the above generative process, an unsupervised LDA defines the joint distribution 

p( 0 d\a) 
d— 1 

For LDA, the learning task is to estimate the unknown parameters (a, (3). Maximum likelihood 
estimation (MLE) is usually applied, which solves the problem 


D 

p(0,Z,W|a,/3) = I] 


N 


II p(Zdn\Od)p{w d n\Zdn, fl) 


(18.2) 


max logp(W|a,/3), s.t : (3k £ V. (18.3) 

Oiy (3 


Once an LDA model is given (i.e., after learning), we can apply it to perform exploratory analysis 
for discovering underlying patterns. This task is done by deriving the posterior distribution using 
Bayes’ rule, that is. 


p(©) Z|W, a, (3) 


p(&, Z, W|a, (3) 
P( W|a,/3) 


(18.4) 


Computationally, however, the likelihood p(W|a, (3) is intractable to compute exactly. Therefore, 
approximate inference algorithms based on variational (Blei et ah, 2003) or Markov chain Monte 
Carlo (MCMC) methods (Griffiths and Steyvers, 2004) have been widely used for parameter esti- 
mation and posterior inference under LDA. 

Note that we have restricted ourselves to treat (3 as an unknown parameter, as done in Blei and 
McAuliffe (2007); Wang et al. (2009). Extension to a Bayesian treatment of (3 (i.e., by putting a 
prior over / 3 and inferring its posterior) can be easily done in LDA as shown in the literature (Blei 
et ah, 2003), where posterior inference is to find p(&, Z, /3|W, a ) by using Bayes’ rule. As we shall 
see, MedLDA can also be easily extended to the full Bayesian setting under a general framework of 
regularized Bayesian inference. 

The LDA described above does not utilize side information for learning topics and inferring 
topic vectors 6 , which could limit their power for predictive tasks. To address this limitation, 
supervised topic models (sLDA) (Blei and McAuliffe, 2007) introduce a response variable Y to 
LDA for each document, as shown in Figure 18.1. For regression, where y £ R, the genera- 
tive process of sLDA is similar to LDA, but with an additional step — Draw a response variable: 
y\zd,r),5 2 ~ J\f(ri T z,d, 6 2 ) for each document d — where 2,4 = jj ^ Z n z dn is the average topic 
assignment over all the words in document d\ 77 is the regression weight vector; and S 2 is a noise 
variance parameter. Then, the joint distribution of sLDA is 


p(0, Z, y, w|a, 0, 77, J 2 ) = p(e, Z, W|a, 0)p(y\z, 77, 5 2 


(18.5) 
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FIGURE 18.1 

Graphical illustration of LDA (left) (Blei et al., 2003); and supervised LDA (right) (Blei and 
McAuliffe, 2007). 


where y = {yd\d=i is the set of labels and p(y|Z, q, S 2 ) = Y\ d p{yd\q r id, <5 2 ) due to the model’s 
conditional independence assumption. In this case, the likelihood is p(y, W|ct, /3, q, S 2 ) and that 
task of posterior inference is to find the posterior distribution p(&, Z|W, y, a , /3, r ), S 2 ) by using 
Bayes’ rule. Again, due to the intractability of the likelihood, variational methods were used to do 
approximate inference and MLE. 

By changing the likelihood model of Y , sLDA can deal with various types of responses, such as 
discrete ones for classification (Wang et al., 2009) using the multi -class logistic regression 


p{y\zd,v) 


ex P( 7 ?J z d) 


(18.6) 


where q y is the vector of parameters associated with class y. However, posterior inference in an 
sLDA classification model can be more challenging than that in the sLDA regression model. This 
is because the non-Gaussian probability distribution in Equation (18.6) is highly nonlinear in q 
and z, and its normalization factor can make the topic assignments of different words in the same 
document strongly coupled. If we perform fully Bayesian inference, the likelihood is non-conjugate 
with the commonly used priors, e.g., a Gaussian prior over q , and this imposes further challenges on 
posterior inference. Variational methods were successfully used to approximate the normalization 
factor (Wang et al., 2009) in an EM algorithm, but they can be computationally expensive as we 
shall demonstrate in the experimental section. 3 

DiscLDA (Lacoste-Julien et al., 2008) is another supervised topic model for classification. 
DiscLDA is an upstream model, and the unknown parameter is the transformation matrix used to 
generate the document latent representations conditioned on class labels. This transformation matrix 
is learned by maximizing the conditional marginal likelihood of the text given class labels. 

This progress notwithstanding, most current developments of supervised topic models have been 
built on a likelihood-driven probabilistic inference paradigm. In contrast, the max-margin-based 
techniques widely used in learning discriminative models (Vapnik, 1998; Taskar et al., 2003) have 
been rarely exploited to learn supervised topic models. Our work in Zhu et al. (2012) presents 
the first formulation of max-margin supervised topic models, 4 followed by various work on image 

3 For fully Bayesian sLDA, a Gibbs sampling algorithm was developed in Zhu et al. (2013) by exploring data augmentation 
techniques. 

4 A preliminary version was first published in 2009 (Zhu et al., 2009). 
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annotation (Yang et al., 2010), classification (Wang and Mori, 201 1), and entity relationship extrac- 
tion (Li et al., 2011). In this chapter, we present a novel formulation of MedLDA under the general 
framework of regularized Bayesian inference. Below, we briefly review the max-margin principle 
using the example of support vector machines. 

18.2.3 Support Vector Machines 

Depending on the nature of the response variable, the max-margin principle can be exploited in both 
classification and regression. Below we use document classification as an example to recapitulate 
the ideas behind SVMs, which we will shortly leverage to build our max-margin topic models. 

Let V = {(xi, 3 /i ) , • • • ,(x£>, 2 /.d)} be a training set, where x £ X are inputs such as 
document-feature vectors, and y are categorical response values taking values from a finite set 
y = {1, • • • , L } . We consider the general multi-class classification where L is greater than 2. 
The goal of SVMs is to find a discriminant function h(y, x: 77 ) G T that could make accurate pre- 
dictions with the argmax rule y = arg max h(y. x; 77 ). One common choice of the function family 

y 

T is linear functions, that is, /i(y, x; 77 ) = where f = (/i, • • • , //) T is a vector of feature 

functions /, : X — » R, and rj y is the corresponding weight vector associated with class y. Formally, 
the linear SVM finds an optimal linear function by solving the following constrained optimization 
problem (Crammer and Singer, 2001 ): 5 

1 ° 

mm d \y\\l+C^2U (18.7) 

s.t. : h(y d , x d ; 77 ) - /i(y, x d ; 77 ) > £ d (y) - £ d , Vd, Vy, 

where 77 = [ 77 ^, • • • , rf[] T is the concatenation of all subvectors; £ are non-negative slack variables 
that tolerate some errors in the training data; C is a positive regularization constant; and i d {y) 
is a non-negative function that measures the cost of predicting y if the ground truth is y d . It is 
typically assumed that ( d (y d ) = 0, i.e., no cost for correct predictions. The quadratic programming 
(QP) problem can be solved in a Lagrangian dual formulation. Samples with non-zero Lagrange 
multipliers are called support vectors. 

18.2.4 Maximum Entropy Discrimination 

The standard SVM formulation does not consider uncertainties of unknown variables, and it is thus 
far difficult to see how to incorporate the max-margin principle into Bayesian mixed membership 
models or topic models in particular. One significantly further step towards uniting the principles 
behind Bayesian generative modeling and max-margin learning is the maximum entropy discrim- 
ination (MED) formalism (Jebara, 2001), which learns a distribution of all possible classification 
models that belong to a particular parametric family, subject to a set of margin-based constraints. 
For instance, the MED classification model learns a distribution 17 ( 77 ) through solving the following 
optimization problem: 

D 

min KL(q(r])\\p 0 (r])) +C^2^ d (18.8) 

q(ri)ev,e j-f 

d—1 

s.t. : Eq[h(y d ,x d ;r))\ - E q [h(y, x d ; 77 )] > i d (y) - Vd, Vy, 

where po(rf) is a prior distribution over the parameters, and KL(j>|| qr) = E p [log(p/( 7 )] is the 
Kullback-Leibler (KL) divergence. 

"The formulation implies that £ d > 0, since all possible predictions including y d are included in the constraints. 
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As studied in Jebara (2001), this MED problem leads to an entropic -regularized posterior dis- 
tribution of the SVM coefficients, q{rj)\ and the resultant predictor y = arg maxE^) x; 77 )] 

y 

enjoys several nice properties and subsumes the standard SVM as special cases when the prior 
Poiv) is standard normal. Moreover, as shown in Zhu and Xing (2009) and Zhu et al. (2011b), 
with different choices of the prior over r), such as a sparsity-inducing Laplace or a nonparametric 
Dirichlet process, the resultant q(rj) can exhibit a wide variety of characteristics and are suitable 
for diverse utilities such as feature selection or learning complex non-linear discriminating func- 
tions. Finally, the recent developments of the maximum entropy discrimination Markov network 
(MaxEnDNet) (Zhu and Xing, 2009) and partially observed MaxEnDNet (PoMEN) (Zhu et al., 
2008) have extended the basic MED to the much broader scenarios of learning structured prediction 
functions with or without latent variables. 

In applying the MED idea to learn a supervised topic model, a major difficulty is the presence 
of heterogeneous latent variables in the topic models, such as the topic vector 6 and topic indica- 
tor Z. In the sequel, we present a novel formalism called maximum entropy discrimination LDA 
(MedLDA) that extends the basic MED to make this possible, and at the same time discovers latent 
discriminating topics present in the study corpus based on available discriminant side information. 


18.3 MedLDA: Max-Margin Supervised Topic Models 

Now we present a new class of supervised topic models that explicitly employ labeling informa- 
tion in the context of document classification . 6 To make our methodology general, we formalize 
MedLDA under the framework of regularized Bayesian inference (Zhu et al., 2011a), which can 
in principle be applied to any Bayesian mixed membership models with a slight change of adding 
some posterior constraints to consider the supervising side information. 

18.3.1 Bayesian Inference as a Learning Model 

As shown in Equation (18.4), Bayesian inference can be seen as an information processing rule that 
projects the prior po and empirical data to a posterior distribution via the Bayes’ rule. Under this 
classic interpretation, a natural way to consider supervising information is to extend the likelihood 
model to incorporate it, as adopted in sLDA models. 

A fresh interpretation of Bayesian inference was given by Zellner (1988), which provides a 
novel and more natural interpretation of MedLDA, as we shall see. Specifically, the posterior distri- 
bution by Bayes’ rule is in fact the solution of an optimization problem. For instance, the posterior 
p(0, Z\ W, a , (3) of LDA is equivalent to the optimum solution of 

min KL( g (0,Z)|bo(0,Z|a,/3))-E g [logp(W|0,Z,/3)]. (18.9) 

q(e,z)ev 

We will use £o(<?(0, Z), a, (3) to denote the objective function. In fact, we can show that the opti- 
mum objective value is the negative log-likelihood — log p(W|rv:. (3). Therefore, the MLE problem 
can be equivalently written in the variational form 

min ( min C 0 (q(Q,Z), a, /3)) = min C 0 (q(Q, Z), a, (3), (18.10) 

a,/3 q(6.Z)eV ' a,p,q(®.Z)£_V 

which is the same as the objective of the EM algorithm (Blei et al., 2003) if no mean field assump- 
tions are made. For the case where /3 is random, we have the same equality as above but with (3 

6 For regression. MedLDA can be developed as in Zhu et al. (2009). 
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moved from the set of unknown parameters into the distributions. For the fully Bayesian models 
(either treating a as random too or leaving it pre-specified), we can solve an optimization problem 
similar as above to infer the posterior distribution. 

18.3.2 Regularized Bayesian Inference 

For the standard Bayesian inference, the posterior distribution is determined by a prior distribution 
and a likelihood model through the Bayes’ rule. Either the prior or the likelihood model indirectly 
influences the behavior of the posterior distribution. However, under the above optimization formu- 
lation of Bayes’ rule, we can have an additional channel of bringing in additional side information 
to directly regularize the properties of the desired posterior distributions. Let M. be a model con- 
taining all the variables (e.g., © and Z for LDA) whose posterior distributions we are trying to infer. 
Let V be the data (e.g., W) whose likelihood model is defined, and let t be hyperparameters. One 
formal implementation of this idea is the regularized Bayesian inference as introduced in Zhu et al. 
( 2011 a), which solves the constrained optimization problem 

min KL(q(M)\\po(M\T))-E q [logp(V\M,T)\+U(£) (18.11) 

i(M)4 

s.t. : q(M) G Ppost (0, 

where V vos t(£) is a subspace of distributions that satisfy a set of constraints. We assume Ppost(C) 
is non-empty for all £. The auxiliary parameters £ are usually nonnegative and interpreted as slack 
variables. [/(£) is a convex function, which usually corresponds to a surrogate loss (e.g., hinge loss) 
of a prediction rule, as we shall see. Under the above formulation, Zhu et al. (2011a) presented the 
infinite latent SVM models for classification and multi-task learning. Below, we present MedLDA 
as another instantiation of regularized Bayesian models. 

18.3.3 MedLDA: A Regularized Bayesian Model 

Let T> = {(w<j, yd)}d=i a given fully-labeled training set, where the response variable Y takes 
values from the finite set y. MedLDA consists of two parts. The first part is an LDA likelihood 
model for describing input documents. We choose to use an unsupervised LDA, which defines a 
likelihood model for W. The second part is a mechanism to consider supervising signal. Since our 
goal is to discover latent representations Z that are good for classification, one natural solution is to 
connect Z directly to our ultimate goal. MedLDA obtains such a goal by building a classification 
model on Z. One good candidate of the classification model is the max-margin method which avoids 
defining a normalized likelihood model. 

Formally, let 77 denote the parameters of the classification model. As in MED, we treat 77 as 
random variables and want to infer the joint posterior distribution ( 7 ( 77 , 0, Z|27, a, /3), or < 7 ( 77 , 0, Z) 
for short. The classification model is defined as follows. If the latent topic representation z is given, 
MedLDA defines the linear discriminant function as 

F(y,r/,z;w) = r/ T f(y,z), (18.12) 

where f (y, z) is an T/f -dimensional vector whose elements from (y 1 ) K to y K are z and all others 
are zero; and 77 is an TA'-dimensional vector concatenating L class-specific sub-vectors. In order 
to predict on input data, MedLDA defines the effective discriminant function using the expectation 
operator 

F(y, w) = E^rj.z) [F(y, 77 , z; w)], (18.13) 

which is a linear functional of q. 

With the above definitions, a natural prediction rule for a given posterior distribution q is 

y = argm&x F(y; w). 

v<=y 


(18.14) 



378 


Handbook of Mixed Membership Models and Its Applications 


Then, we would like to “regularize” the properties of the latent topic representations to make them 
suitable for a classification task. Here, we adopt the framework of regularized Bayesian inference 
and impose the following max-margin constraints on the posterior distributions: 

F(y d ; w d) ~ F(y ; w d ) > £ d (y), My G y , Md. (18.15) 

That is, we want to find a “posterior distribution” that can predict correctly on all the training data 
using the prediction rule (18.14). However, in many cases, these hard constraints would be too strict. 
In order to learn a robust classifier for the datasets which are not separable, a natural generalization 
is to impose the soft max-margin constraints 


F(y d ; w d ) - F(y; w d ) > £ d {y) - £ d , My G y, Md, (18.16) 

where £ = { £ d } are non-negative slack variables. Let 

Ci(q(v, ©, Z), a, /3) = KL(g(r 7 , 0, Z)||pofa, 0, Z| a, (3)) - EJlogp(W|Z, f3)}. 

We define the soft-margin MedLDA model as solving 


min 

s.t. : 


d= 1 

E q [ri r A£(y, z d )] > £ d {y) - £ d , $ d > 0, Md, My, 


(18.17) 


where the prior is Po{v, 0, Z| a, (3) = p 0 {ri)po(®, Z| a, f3), and Af (y, z d ) = f (y d , z d ) - f(y, z d ). 
By removing slack variables, problem (18.17) can be equivalently written as 


min Ci{q(rj, 0, Z), a, /3) + C'lZ(q(r], 0, Z)), (18.18) 

q(Tj,&,Z)£V,a,0 

where 

F = ^ argmax (£ d (y) - E 9 [r7 T Af(|/, z d )]) 

d * 

is the hinge loss, an upper bound of the prediction error on training data. 

Based on the equality in Equation (18.10), we can see the rationale underlying MedLDA, which 
is that we want to find latent topical representations q(&, Z) and a model parameter distribution 
q{rf) which on one hand tends to predict as accurate as possible on training data, while on the 
other hand tends to explain the data well. The two parts are closely coupled by the expected margin 
constraints. 

Although in theory we can use either sLDA (Wang et ah, 2009) or LDA as a building block 
of MedLDA to discover latent topical representations, as we have discussed in Section 18.2.2, in- 
ference under sLDA could be harder and slower because the probability model of discrete Y in 
Equation (18.6) is nonlinear over r) and Z, both of which are latent variables in our case, and its 
normalization factor strongly couples the topic assignments of different words in the same docu- 
ment. Therefore, we choose to use LDA that only models the likelihood of document contents W 
but not document label Y as the underlying topic model to discover latent representations Z. Even 
with this likelihood model, document labels can still influence topic learning and inference because 
they induce margin constraints pertinent to the topical distributions. As we shall see, the resul- 
tant MedLDA classification model can be efficiently learned by utilizing existing high-performance 
SVM solvers. Moreover, since the goal of max-margin learning is to directly minimize a hinge loss 
(i.e., an upper bound of the empirical loss), we do not need a normalized distribution model for 
response variables Y. 

Note that we have taken a full expectation to define F(y; w) instead of taking the mode as 
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done in latent SVMs (Felzenszwalb et al., 2010; Yu and Joachims, 2009), because expectation is a 
nice linear functional of the distributions under which it is taken, whereas taking the mode involves 
the highly nonlinear argmax function for discrete Z, which could lead to a harder inference task. 
Furthermore, due to the same reason to avoid dealing with a highly nonlinear discriminant function, 
we did not adopt the method in Jebara (2001) either, which uses log-likelihood ratios to define 
the discriminant function when considering latent variables in MED. Specifically, in our case, the 
max-margin constraints would be 


Md, My, log 


p(y d \w d ,a,f3) 

p{y\xv d ,a,(3) 


> td,(y) - 


(18.19) 


which are highly nonlinear due to the complex form of the marginal likelihood j>(y\’w d , a , /3). Our 
linear expectation operator is an effective tool to deal with latent variables in the context of maxi- 
mum margin learning. In fact, besides the present work, we have successfully applied this operator 
to other challenging settings of learning latent variable structured prediction models with nontriv- 
ial dependence structures among output variables (Zhu et al., 2008) and learning nonparametric 
Bayesian models (Zhu et al., 201 lb;a). 


18.3.4 Optimization Algorithm for MedLDA 

Although we have used the simple linear expectation operator to define max-margin constraints, 
the problem of MedLDA is still intractable to directly solve due to the intractability of C-\ . Below, 
we present a coordinate descent algorithm with a further constraint on the feasible distribution 
q(r). 0, Z). Specifically, we impose the fully factorized mean field constraint that 

D N 

q(r/, 0, Z) = q(rj) g(0d|7d) q{zdn\<t>dn), (18.20) 

d—1 n—1 

where 7 ,/ is a K -dimensional vector of Dirichlet parameters and each <j> dn parameterizes a multi- 
nomial distribution over K topics. With this constraint, we have 

F(y ; w d ) = E 9 [j7] T f(r/, 4> d ), 

where <p d = E^zJ = 1 /N 4> dn ', and the objective can be effectively evaluated since 

©, Z), a, (3) = KL(q(ri)\\p 0 (ri)) + £ 0 (q(&,Z),a,(3), (18.21) 

where £0 can be computed as in Blei et al. (2003). By considering the unconstrained formulation 
(18.18), our algorithm alternates between the following steps: 

1. Solve for q{rf)\ When q{&, Z) and (a, (3) are fixed, the subproblem (in an equivalent con- 
strained form) is to solve 


min KL(g(?7)||po(»7)) + 

q(v)ev,£, r> ^ 


D 


d=l 


s.t. : E q [r]] T Ai(y, cf> d ) > t d {y) - £ d ,Md,My. 

By using Lagrangian methods, we have the optimum solution 


q(v) = ^Po(^)exp 

V d y J 


(18.22) 


(18.23) 
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where the Lagrange multipliers p, are the solution of the dual problem: 

max — log 'I' + EE MdA t d (y) (18.24) 

M y 

s.t. : °>§ > Vd - 

y L J 

We can choose different priors in MedLDA for various regularization effects. Here, we consider 
the normal prior. For the standard normal prior po(rj) = A r (0. J), we can get: q(rj) is a normal 
with a shifted mean, i.e., q{rf) = A/"(A, /), where A = J2 y Md A f(y, <pd), and the dual 
problem is 

max - hiiEE^ Af ( y ’^)ii2 +EE^ A ^(y) (18.25) 

d y d y 

s.t.: °’§ ’ V(i 

y L J 

The primal form of problem (18.25) is a multi-class SVM (Crammer and Singer, 2001): 

1 C 

min ^||A||| + — (18.26) 

U d= 1 

s.t. : A T E[Af d (y)] > A£ d (y) - Vd, Vj/. 

We denote the optimum solution by q* ( rj ) and its mean by A* . 

2. Solve for 4 > and 7: By keeping q{rf) at its previous optimum solution and fixing (at,f 3 ), we 
have the subproblem as solving 

c d 

min £ 0 (g(©,Z),a,/3) + — V max (£ d (y) - (A*) T Af (y,(j> d )). (18.27) 

4>,~t L) yey 


Since q is fully factorized, we can perform the optimization on each document separately. We 
observe that the constraints in MedLDA are not dependent on 7 and q(rj) is also not directly 
connected with 7. Thus, optimizing C with respect to 7 d leads to the same update rule as in 
LDA: 

N 

7 a + '^ / 4>dn ■ (18.28) 

n = 1 


For 4>, the constraints do affect its solution. Although in theory we can solve this subproblem 
using Lagrangian dual methods, it would be hard to derive the dual objective function (if pos- 
sible at all). Here, we choose to update <p using sub-gradient methods. Specifically, let g{4>, 7 ) 
be the objective function of problem (18.27). The sub-gradient is 


dg(4>, 7) = dCp SL(\* _ \* \ 

dcfdn d4> dn ND 


(18.29) 


where yd = argmax(^(j/) + (A* ) T f (z/, 4>d)) is the loss-augmented prediction. By setting the 

y 

sub-gradient equal to zero, we can get 
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We can see that the first two terms in Equation (18.30) are the same as in unsupervised 
LDA (Blei et al., 2003), and the last term is due to the max-margin formulation of MedLDA and 
reflects our intuition that the discovered latent topical representation is influenced by the margin 
constraints. Specifically, for those examples that are misclassified (i.e., yd / yf), the last term 
will not be zero, and it acts as a regularization term that biases the model towards discovering 
latent representations that tend to make more accurate prediction on these difficult examples. 
Moreover, this term is fixed for words in the document and thus will directly affect the latent 
representation of the document (i.e., jf) and therefore leads to a discriminative latent represen- 
tation. As we shall see in Section 18.4, such an estimate is more suitable for the classification 
task: for instance, MedLDA needs many fewer support vectors than the max-margin classifiers 
that are built on raw text or the topical representations discovered by LDA. 

3. Solve for a and (3: The last substep is to solve for (a. /3) with q(rf) and q{&, Z) fixed. This 
subproblem is the same as the problem of estimating ( a . /3 ) in LDA, since the constraints do 
not directly act on (a, (3). Therefore, we have the same update rules: 

fik w (X EE l{w dn =w)<t> k dn , (18.31) 

d n 

where I(-) is an indicator function that equals 1 if the condition holds, 0 otherwise. Lor a, 
the same gradient descent algorithm as in Blei et al. (2003) can be applied to find a numerical 
solution. 

The above formulation of MedLDA has a slack variable associated with each document. This 
is known as the n-slack formulation (Joachims et al., 2009). Another equivalent formulation, which 
can be more efficiently solved, is the so called 1 -slack formulation. The 1 -slack MedLDA can be 
written as follows: 

min Ci(q(Tj' ©,Z),a,/3) + (18.32) 

q{ 77,©,Z),q;,/3,£ 

s - t - : ^E E «fo TAf «*(&)] - f)^2 M d(vd) -£Myu--- ,Vd)- 

d d 

By using the above alternating minimization algorithm and the cutting plane algorithm for solving 
the 1 -slack as well as n-slack multi-class SVMs (Joachims et al., 2009), which is implemented in 
the SVM str '" ct package, 7 we can solve the 1 -slack or n-slack MedLDA model efficiently, as we 
shall see in Section 18.4.3. SVM struc * provides the solutions of the primal parameters A as well as 
the dual parameters fi, which are needed to do inference. 


18.4 Experiments 

In this section, we provide qualitative as well as quantitative evaluation of MedLDA on topic esti- 
mation and document classification. Lor MedLDA and other topic models (except DiscLDA, whose 
implementation details are explained in Lootnote 12), we optimize the I\ -dimensional Dirichlet pa- 
rameters a using the Newton-Raphson method (Blei et al., 2003). Lor initialization, we set <fi to 
be uniform and each topic / 3j~ to be a uniform distribution plus a very small random noise; we set 

7 See http://svmlight.joachims.org/svm\ _multiclass.html. 
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the posterior mean of 77 to be zero. We have released our implementation for public use . 8 In all the 
experimental results, we also report the standard deviation for a topic model with five randomly 
initialized runs. 

18.4.1 Topic Estimation 

We begin with an empirical assessment of topic estimation by MedLDA on the 20 Newsgroups 
dataset with a standard list of stopwords 9 removed. The dataset contains about 20,000 postings in 
20 related categories. We compare this with unsupervised LDA . 10 We fit the dataset to a 110-topic 
MedLDA model, which exploits the supervised category information, and a 1 10-topic unsupervised 
LDA, which ignores category information. 

Figure 18.2 shows the 2D embedding of the inferred topic proportions 6 by MedLDA and LDA 
using the t-SNE stochastic neighborhood embedding method (van der Maaten and Hinton, 2008), 
where each dot represents a document and each color-shape pair represents a category. Visually, the 
max-margin based MedLDA produces a good separation of the documents in different categories, 
while LDA does not produce a well-separated embedding, and documents in different categories 
tend to mix together. This is consistent with our expectation that MedLDA could produce a strong 
connection between latent topics and categories by doing supervised learning, while LDA ignores 
supervision and thus builds a weaker connection. Intuitively, a well-separated representation is more 
discriminative for document categorization. This is further empirically supported in Section 18.4.2. 
Note that a similar embedding was presented in Lacoste-Julien et al. (2008), where the transforma- 
tion matrix in their model is pre-designed. The results of MedLDA in Figure 18.2 are automatically 
learned. 

It is also interesting to examine the discovered topics and their relevance to class labels. In 
Figure 18.3a we show the top topics in four example categories as discovered by both MedLDA 
and LDA. Here, the semantic meaning of each topic is represented by the first ten high probability 
words. 

To visually illustrate the discriminative power of the latent representations, i.e., the topic propor- 
tion vector 6 of documents, we illustrate and compare the per-class distribution over topics for each 
model at the right side of Figure 18.3a. This distribution is computed by averaging the expected 
topic vector of the documents in each class. We can see that MedLDA yields sharper, sparser, and 
fast decaying per-class distributions over topics. For the documents in different categories, we can 
see that their per-class average distributions over topics are very different, which suggests that the 
topical representations by MedLDA have a good discrimination power. Also, the sharper and sparser 
representations by MedLDA can result in a simpler max-margin classifier (e.g., with fewer support 
vectors), as we shall see in Section 18.4.2. All these observations suggest that the topical representa- 
tions discovered by MedLDA have a better discriminative power and are more suitable for prediction 
tasks (see Section 18.4.2 for prediction performance). This behavior of MedLDA is in fact due to 
the regularization effect enforced over <fi as shown in Equation (18.30). On the other hand, the fully 
unsupervised LDA seems to discover topics that model the fine details of documents with no regard 
for their discrimination power (i.e., it discovers different variations of the same topic which results in 
a flat per-class distribution over topics). For instance, in the class comp. graphics, MedLDA mainly 
models documents using two salient, discriminative topics (T69 and Til), whereas LDA results in 
a much flatter distribution. Moreover, in the cases where LDA and MedLDA discover comparably 
the same set of topics in a given class (like politics, mideast and misc.forsale), MedLDA results in a 
sharper low-dimensional representation. 

8 See http://www.ml- thu.net/$\sim$jun/software.shtml. 

9 See http://mallet.cs.umass.edu/. 

10 We implemented LDA based on the public variational inference code by David Blei, using the same data structures as 
MedLDA for fair comparison. 
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FIGURE 18.2 

t-SNE 2D embedding of the topical representation by MedLDA (above) and unsupervised LDA 
(below). The mapping between each index and category name can be found in: 
http://people.csail.mit.edu/jrennie/20Newsgroups/. 
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FIGURE 18.3 

Top topics under each class as discovered by the MedLDA and LDA models (a). The average entropy 
of 6 over documents on 20 Newsgroups data (b). 
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A quantitative measure for the sparsity or sharpness of the distributions over topics is the entropy. 
We compute the entropy of the inferred topic proportion for each document and take the average 
over the corpus. Here, we compare MedLDA with LDA, sLDA for multi -class classification (multi- 
sLDA) (Wang et al., 2009), 11 and DiscLDA (Lacoste-Julien et al., 2008). 12 For DiscLDA, as in 
Lacoste-Julien et al. (2008), we fix the transformation matrix and set it to be diagonally sparse. 
We use the standard training/testing split 13 to fit the models on training data and infer the topic 
distributions on testing documents. Figure 18.3b shows the average entropy of different models 
on testing documents when different topic numbers are chosen. For DiscLDA, we set the class- 
specific topic number K 0 = 1, 2, 3, 4, 5 and correspondingly K = 22, 44, 66, 88, 110. We can see 
that MedLDA yields the smallest entropy, which indicates that the probability mass is concentrated 
on quite a few topics, consistent with the observations in Figure 18.3a. In contrast, for LDA the 
probability mass is more uniformly distributed on many topics (again consistent with Figure 18.3a), 
which results in a higher entropy. For DiscLDA, although the transformation matrix is designed 
to be diagonally sparse, the distributions over the class-specific topics and shared topics are flat. 
Therefore, the entropy is also high. Using automatically learned transition matrices might improve 
the sparsity of DiscLDA. 

18.4.2 Prediction Accuracy 

We perform binary and multi -class classification on the 20 Newsgroup dataset. To obtain a baseline, 
we first fit all the data to an LDA model, and then use the latent representation of the training 14 
documents as features to build a binary or multi-class SVM classifier. We denote this baseline as 
LDA+SVM. 

Binary Classification 

As in Lacoste-Julien et al. (2008), the binary classification is to distinguish postings of the news- 
group alt. atheism and the postings of the group talk.religion.misc. The training set contains 856 
documents with a split of 480/376 over the two categories, and the test set contains 569 documents 
with a split of 318/251 over the two categories. Therefore, the naive baseline that predicts the most 
frequent category for all test documents has accuracy 0.672. 

We compare the binary MedLDA with sLDA, DiscLDA, LDA+SVM, and the standard binary 
SVM built on raw text features. For supervised LDA, we use both the regression model (sLDA) (Blei 
and McAuliffe, 2007) and classification model (multi-sLDA) (Wang et al., 2009). For the sLDA 
regression model, we fit it using the binary representation (0/1) of the classes, and use a threshold 0.5 
to make prediction. For MedLDA, to see whether a second-stage max-margin classifier can improve 
the performance, we also build a method of MedLDA+SVM similar to LDA+SVM. For DiscLDA, 
we fix the transition matrix. Automatically learning the transition matrix can yield slightly better 
results, as reported in Lacoste-Julien (2009). For all the above methods that utilize the class label 
information, they are fit ONLY on the training data. 

11 We thank the authors for providing their implementation, on which we made necessary slight modifications, e.g., im- 
proving the time efficiency and optimizing ol. 

12 DiscLDA is a conditional model that uses class-specific topics and shared topics. Since the code is not publicly available, 
we implemented an in-house version by following the same strategy as in Lacoste-Julien et al. (2008) and share K\ topics 
across classes and allocate K$ topics to each class, where K\ = 2Kq, and we varied Kq = {1, 2, • • • }. We should 
note here that Lacoste-Julien et al. (2008) and Lacoste-Julien (2009) gave an optimization algorithm for learning the topic 
structure (i.e., a transformation matrix), however, since the code is not available, we resorted to one of the fixed splitting 
strategies mentioned in the paper. Moreover, for the multi-class case, the authors only reported results using the same fixed 
splitting strategy we mentioned above. For the number of iterations for training and inference, we followed Lacoste-Julien 
(2009). Moreover, following Lacoste-Julien (2009) and personal communication with the first author, we used symmetric 
Dirichlet priors on (3 and 6, and set the Dirichlet parameters to 0.01 and 0.1/ ( Kq + K\), respectively. 

13 See http://people.csail.mit.edu/jrennie/20Newsgroups/. 

14 We use the training/testing split in: http://people.csail.mit.edu/jrennie/20Newsgroups/. 
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FIGURE 18.4 

Classification accuracy of different models for: (a) binary and (b) multi-class classification on the 
20 Newsgroup data. 

We use the SVM-light (Joachims, 1999), which provides both primal and dual parameters, to 
build SVM classifiers and to estimate the posterior mean of rj in MedLDA. The parameter C is 
chosen via 5-fold cross-validation during training from {k 2 : k = 1, ■ ■ ■ ,8}. For each model, we 
run the experiments five times and take the average as the final results. The prediction accuracy of 
different models with respect to the number of topics is shown in Figure 18.4(a). For DiscLDA, we 
follow Lacoste-Julien et al. (2008) and set K = 2 Kq + K \ , where K$ is the number of class-specific 
topics, Ki is the number of shared topics, and K\ = 2 K {) . Here, we set K 0 = 1 . ■ • • , 8, 10. 

We can see that the max-margin MedLDA outperforms the likelihood-based downstream mod- 
els, including multi-sLDA, sLDA, and LDA+S VM. The best performances of the two discriminative 
models, MedLDA and DiscLDA, are comparable. However, MedLDA is easier to learn and faster 
in testing, as we shall see in Section 18.4.3. Moreover, the different approximate inference algo- 
rithms used in MedLDA (i.e., variational approximation) and DiscLDA (i.e., Monte Carlo sampling 
methods) can also make the performance different. We tried the collapsed variational inference 
(Teh et al., 2006) for MedLDA and it can give slightly better results. However, the collapsed vari- 
ational method is computationally more expensive. Finally, since MedLDA already integrates the 
max-margin principle into its training, our conjecture is that the combination of MedLDA and SVM 
does not further improve the performance much on this task. We believe that the slight differences 
between MedLDA and MedLDA+SVM are due to the tuning of regularization parameters. For effi- 
ciency, we do not change the regularization constant C during training MedLDA. The performance 
of MedLDA would be improved if we selected a good C in different iterations because the data 
representation is changing. 

Multi-Class Classification 

We perform multi-class classification on 20 Newsgroups with all the 20 categories. The dataset 
has a balanced distribution over the categories. For the test set, which contains 7,505 documents 
in total, the smallest category has 251 documents and the largest category has 399 documents. For 
the training set, which contains 11,269 documents, the smallest and the largest categories contain 
376 and 599 documents, respectively. Therefore, the naive baseline that predicts the most frequent 
category for all the test documents has the classification accuracy 0.0532. 

We compare MedLDA with LDA+SVM, multi-sLDA, DiscLDA, and the standard multi-class 
SVM built on raw text. We use the SVM struc * package with a cost function as A ld(y) — fl ( y ^ yd) 
to solve the sub-step of learning q{r) ) and build the SVM classifiers for LDA+SVM. The parameter 
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£ is selected with 5-fold cross-validation. The average results, as well as standard deviations over 
5 randomly initialized runs, are shown in Figure 18.4(b). For DiscLDA, we use the same equation 
as in Lacoste-Julien et al. (2008) to set the number of topics and set I\ 0 = 1, • • • ,5. We can see 
that supervised topic models discover more predictive representations for classification, and the 
discriminative max-margin MedLDA and DiscLDA perform comparably, slightly better than the 
standard multi-class SVM (about 1.3± 0.3 percent improvement in accuracy). However, as we have 
stated and will show in Section 18.4.3, MedLDA is simpler to implement and faster in testing than 
DiscLDA. As we shall see shortly, MedLDA needs much fewer support vectors than standard SVM. 

Figure 18.5(a) shows the classification accuracy on the 20 Newsgroups dataset for MedLDA 
with 70 topics. We show the results with £ manually set to 1, 4, 8, 12, • • • , 32. We can see that al- 
though the common 0/1-cost works well for MedLDA, we can get better accuracy by using a larger 
cost to penalize wrong predictions. The performance is quite stable when £ is set to be larger than 8. 
The reason why £ affects the performance is that £ as well as C control: 1) the scale of the posterior 
mean of rj and the Lagrangian multipliers /+ whose dot-product regularizes the topic mixing pro- 
portions in Equation (18.30); and 2) the goodness-of-fit of the MED large-margin classifier on the 
data. For practical reasons, we only try a small subset of candidate C values in parameter search, 
which can also influence the difference on performance in Figure 18.5(a). Performing very careful 
parameter search on C could possibly shrink the difference. Finally, for a small £ (e.g., 1 for the 
0/1-cost), we usually need a large C in order to obtain good performance. But, our empirical expe- 
rience with SVM st ™ c * shows that the multi-class SVM with a larger C (and smaller T) is typically 
more expensive to train than the SVM with a larger l (and smaller C). That is one reason why we 
choose to use a large £. 

Figure 18.5(b) shows the number of support vectors for MedLDA, LDA+SVM, and the multi- 
class SVM built on raw text features, which are high-dimensional (~60,000 dimensions for the 20 
Newsgroup data) and sparse. Here we consider the traditional n-slack formulation of multi-class 
SVM and n-slack MedLDA using the S'VM strvct package, where a support vector corresponds to a 
document-label pair. For MedLDA and LDA+SVM, we set K = 70. For MedLDA, we report both 
the number of support vectors at the final iteration and the average number of support vectors over 
all iterations. We can see that both MedLDA and LDA+SVM generally need many fewer support 
vectors than the standard SVM on raw text. The major reason is that both MedLDA and LDA+SVM 
use a much lower-dimensional and more compact representation for each document. Moreover, 
MedLDA needs (about 4 times) fewer support vectors than LDA+SVM. This could be because 



(a) 


lb) 


FIGURE 18.5 

Sensitivity to the cost parameter £ for the MedLDA (a); and the number of support vectors for 71 - 
slack multi-class SVM, LDA+SVM, and 71 -slack MedLDA (b). For MedLDA, we show both the 
number of support vectors at the final iteration and the average number during training. 
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MedLDA makes use of both text contents and the supervising class labels in the training data, 
and its estimated topics tend to be more discriminative when being used to infer the latent topical 
representations of documents, i.e., using these latent representations by MedLDA, the documents 
in different categories are more likely to be well-separated, and therefore the max-margin classifier 
is simpler (i.e., needs fewer support vectors). This observation is consistent with what we have 
observed on the per-class distributions over topics in Figure 18.3a. Finally, we observe that about 
32% of the support vectors in MedLDA are also the support vectors in multi-class SVM on the raw 
features. 


18.4.3 Time Efficiency 

Now, we report empirical results on time efficiency in training and testing. All the following results 
are achieved on a standard desktop with a 2.66GHz Intel processor. We implement all the models in 
C++ language. 

Training Time 

Figure 18.6 shows the average training time together with standard deviations on both binary and 
multi -class classification tasks with 5 randomly initialized runs. Here, we do not compare with Dis- 
cLDA because learning the transition matrix is not fully implemented in Lacoste-Julien (2009), but 
we will compare the testing time with it. From the results, we can see that for binary classification, 
MedLDA is more efficient than multi-class sLDA and is comparable with LDA+SVM. The slow- 
ness of multi-class sLDA is because the normalization factor in the distribution model of y strongly 
couples the topic assignments of different words in the same document. Therefore, the posterior 
inference is slower than that of LDA and MedLDA, which uses LDA as the underlying topic model. 
For the sLDA regression model, it takes even more training time due to the mismatch between its 
normal assumption and the non-Gaussian binary response variables, which prolongs the E-step. 

For multi-class classification, the training time of MedLDA is mainly dependent on solving a 
multi -class SVM problem. Here, we implemented both 1 -slack and n-slack versions of multi-class 




FIGURE 18.6 

Training time (CPU seconds in log-scale) of different models for both binary (left) and multi-class 
classification (right). 
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SVM (Joachims et al., 2009) for solving the sub-problem of estimating q(rj) and Lagrangian mul- 
tipliers in MedLDA. As we can see from Figure 18.6, the MedLDA with 1-slack SVM as the sub- 
solver can be very efficient, comparable to unsupervised LDA+SVM. The MedLDA with n-slack 
SVM solvers is about three times slower. Similar to the binary case, for the multi-class supervised 
sLDA (Wang et al., 2009), because of the normalization factor in the category probability model 
(i.e., a softmax function), the posterior inference on different topic assignment variables (in the 
same document) is strongly correlated. Therefore, the inference is about ten times slower than that 
on LDA and MedLDA, which takes LDA as the underlying topic model. 

We also show the time spent on inference and the ratio it takes over the total training time for 
different models in Figure 18.7(a). We can clearly see that the difference between 1-slack MedLDA 
and /(-slack MedLDA is on the learning of SVMs. Both methods have similar inference time. We 
can also see that for LDA+SVM and multi-sLDA, more than 95% of the training time is spent on 
inference, which is very expensive for multi-sLDA. Note that LDA+SVM takes a longer inference 
time than MedLDA because we use more data (both training and testing) to learn unsupervised 
LDA. 



LDA+SVM Med LD A c (1 -slack) MedLDA 0 (n-slack) multi-sLDA 



(a) 


(b) 


FIGURE 18.7 

The inference time and total training time for learning different models, as well as the ratio of 
inference time over total training time (a). For MedLDA, we consider both the 1 -slack and n-slack 
formulations; for LDA+SVM, the SVM classifier is the fast 1-slack formulation; and (b) Testing 
time of different models with respect to the number of topics for multi -class classification. 

Testing Time 

Figure 18.7(b) shows the average testing time with standard deviation on the 20 Newsgroup testing 
data with five randomly initialized runs. We can see that MedLDA, multi-class sLDA, and unsuper- 
vised LDA are comparable in testing time, faster than that of DiscLDA. This is because all three 
models of MedLDA, multi-class sLDA, and LDA are downstream models (see the Introduction for 
definition). In testing, they do exactly the same tasks, i.e., inferring the overall latent topical repre- 
sentation and doing prediction with a linear model. Therefore, they have comparable testing time. 
However, DiscLDA is an upstream model, for which the inference to find the category-dependent 
latent topic representations is done multiple times. Therefore, in principle, the testing time of an up- 
stream topic model is about \C\ times slower than that of its downstream counterpart model, where 
C is the finite set of categories. The results in Figure 18.7(b) show that DiscLDA is roughly twenty 
times slower than other downstream models. Of course, the different inference algorithms can also 
make the testing time different. 
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18.5 Conclusions and Discussions 

We have presented maximum entropy discrimination LDA (MedLDA), a supervised topic model 
that uses the discriminative max-margin principle to estimate model parameters such as topic dis- 
tributions underlying a corpus, and infer latent topical vectors of documents. MedLDA integrates 
the max-margin principle into the process of topic learning and inference via optimizing one single 
objective function with a set of expected margin constraints. The objective function is a tradeoff 
between the goodness-of-fit of an underlying topic model and the prediction accuracy of the re- 
sultant topic vectors in a max-margin classifier. We provide empirical evidence which appears to 
demonstrate that this integration could yield predictive topical representations that are suitable for 
prediction tasks, such as classification. Our results demonstrate that MedLDA is an attractive super- 
vised topic model, which can achieve state-of-the-art performance for topic discovery and prediction 
accuracy while needing fewer support vectors than competing max-margin methods that are built 
on raw text or the topical representations discovered by unsupervised LDA. 

The results of prediction accuracy on the 20 Newsgroups dataset show that MedLDA works 
slightly better than the SVM classifiers built on raw input features. These slight improvements tend 
to raise the question, “When and why should we choose MedLDA?” We have two possible answers: 

1. MedLDA is a topic model. Besides predicting on unseen data, MedLDA can discover semantic 
patterns underlying complex data. In contrast, SVM models are more like black box machines 
which take raw input features and find good decision boundaries or regression curves, but that 
are incapable of discovering or considering hidden structures of complex data. 15 As an extension 
of SVM, MedLDA performs both exploratory analysis (i.e., topic discovery) and predictive tasks 
(e.g., classification) simultaneously. So, the first selection rule is that if we want to disclose some 
underlying patterns besides doing prediction, MedLDA should be preferred to SVM. 

2. Even if our goal is prediction performance, MedLDA should also be considered as a competitive 
alternative. As shown in the synthetic experiments (Zhu et al., 2012) as well as the follow-up 
work (Yang et al., 2010; Wang and Mori, 2011; Li et al., 2011), depending on the data and 
problems, max-margin supervised topic models can outperform SVM models, or at least they 
are comparable if no gains are obtained. One reason that leads to our current results on 20 
Newsgroups is that the fully factorized mean field assumption could be too restricted and lead 
to inaccurate estimates. In fact, we have tried more sophisticated inference methods such as 
collapsed variational inference (Teh et al., 2006) and collapsed Gibbs sampling, 16 both of which 
could lead to superior prediction performance. 

Finally, MedLDA presents one of the first successful attempts, in the context of Bayesian mixed 
membership models (or topic models in particular), towards pushing forward the interface be- 
tween max-margin learning and Bayesian generative modeling. As further demonstrated in others’ 
work (Yang et al., 2010; Wang and Mori, 201 1 ; Li et al., 201 1) as well as our recent work on regu- 
larized Bayesian inference (Chen et al., 2012; Zhu et al., 2011a;b; Zhu, 2012; Xu et al., 2012), the 
max-margin principle could be a fruitful addition to “regularize” the desired posterior distributions 
of Bayesian models for performing better prediction in a broad range of scenarios, such as image 
annotation/classification, multi-task learning, social link prediction, low-rank matrix factorization, 
etc. Of course, the flexibility on performing max-margin learning brings in new challenges. For ex- 
ample, the learning and inference problems of such models need to deal with some non-smooth loss 

15 Some strategies like sparse feature selection can be incorporated to make an SVM more interpretable in the original 
feature space, but this is beyond the scope of this discussion. 

16 Sampling methods for MedLDA can be developed by using Lagrangian dual methods. Details are reported in Jiang et al. 
( 2012 ). 
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functions (e.g., the hinge loss in MedLDA), for which developing efficient algorithms for large-scale 
applications is a challenging research problem. Moreover, although we have good theoretical under- 
standings of the generalization ability of max-margin methods without latent variables (e.g., SVMs), 
it is a challenging problem to provide theoretical guarantees for the generalization performance of 
max-margin models with latent variables. 
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Population stratification (or population structure) is the presence of genotypic differences among 
different groups of individuals. Genotype-based clustering of individuals is an important way of 
summarizing the genetic similarities and differences between groups of individuals. It enables us to 
formulate and test hypotheses regarding evolutionary history of populations. Identifying population 
stratification is also important in genetic association studies and other population genetic analyses. 

We present blockmStruct, a mixed membership model for identifying population stratification 
from single nucleotide polymorphism (SNP) data. Our model incorporates mutations in SNP haplo- 
type blocks in the ancestry inference. We demonstrate using simulation data that mStruct recovers 
ancestry more accurately than similar methods without mutation models. We analyze SNP data from 
597 individuals from The Human Genome Diversity Project and the HapMap project and show that 
the recovered population structure recapitulates geographic patterns of diversity. 
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19.1 Introduction 

Due to the continuing improvements in sequencing technologies and their falling costs, a num- 
ber of genomic datasets such as HapMap (Gibbs, 2003) and the Human Genome Diversity Project 
(HGDP) (Cavalli-Sforza, 2005) are now available for study. These datasets contain individuals from 
various ethno-linguistic groups and regions across the world. An important task, therefore, is to 
characterize the genetic variation present in a given sample of individuals. Genotype clustering is 
one way of identifying the population stratification in a given sample of individuals. It provides a 
summarization of individuals based on genetic similarity and differences that can be interpreted and 
visualized easily. We can use the resulting summary to propose and test hypotheses about the evolu- 
tionary history of populations. Detecting population stratification present in a sample of individuals 
is also essential for reducing false positives in genetic association studies. 

A number of evolutionary processes such as mutation, recombination, selection, admixture, mi- 
grations, expansions, and bottlenecks affect the ancestry and genomes of a group of individuals. 
These processes result in genomes which have contributions from more than one ancestral popula- 
tion. Different parts of an individual’s genome can be inherited from ancestors of different popula- 
tions. The Structure model by Pritchard et al. (2000) was one of the early attempts to address the 
problem of clustering individuals while allowing partial membership in multiple ancestral popula- 
tions. It used a mixed membership model to determine the fractional contributions from multiple 
ancestral populations to an individual genome. Various extensions to the underlying model that 
account for other evolutionary processes such as mutation (Shringarpure and Xing, 2009) and re- 
combination (Falush et al., 2003) have also been proposed. Figure 19.1 shows the representation of 
ancestry vectors for a set of individuals from the HAPMAP dataset. 

We consider here a mixed membership model that takes into account mutations during inheri- 
tance from ancestral populations to modern populations. Our model is a modification of the mStruct 
model proposed in Shringarpure and Xing (2009). It considers single nucleotide polymorphisms 
(SNPs) to occur in unlinked haplotype blocks and hypothesizes that mutations occur within the 
SNP blocks during inheritance. We validate the model on simulated data and show how model- 
ing mutations in haplotype blocks can allow us to accurately recover ancestry when mutations are 
present. We show results on data from 597 individuals in the Human Genome Diversity Project 
(HGDP) at 10,000 SNPs on chromosome 1. 



FIGURE 19.1 

Analysis of individuals from the HAPMAP dataset assuming K = 3 ancestral populations. The 
mixed membership vector underlying each individual is represented as a thin vertical line of unit 
length and multiple colors, with the height of each color reflecting the fraction of the individual’s 
genome originating from a certain ancestral population denoted by that color and formally repre- 
sented by APs. 
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19.2 Related Work 

A number of approaches have been proposed for detecting population stratification from genomic 
data. One class of methods uses low-dimensional projections and eigenanalysis to cluster individ- 
uals (Patterson et al., 2006). These methods do not assume any specific evolutionary model for the 
genomic data. Their advantages are their efficiency and the ability to describe the statistical signifi- 
cance of the stratification produced. 

Another class of methods for population stratification assumes an explicit evolutionary model 
for the genomic data. These model-based stratification methods, starting with the Structure model 
by Pritchard et al. (2000), have become very popular due to their interpretability. The Structure 
method uses a hierarchical Bayesian model to capture the effect of admixture on modern genomes. 
The underlying framework for these methods is the mixed membership model of Erosheva et al. 
(2004). Such a model postulates that a genome, or the ensemble of genetic markers of an individual, 
is made up of independently and identically distributed (iid) samples (Pritchard et al., 2000) from 
multiple population-specific multinomial distributions (known as allele frequency profiles, or AP) 
of marker alleles. The mixed membership model represents each ancestral population by a specific 
AP which defines a unique vector of allele frequencies of each marker in each ancestral population. 
The fraction of contributions from each AP in a modern individual genome is represented as an 
admixing vector (also known as an ancestral proportion vector or structure vector) in a structural 
map over the population sample. 

The model parameters are inferred using Markov Chain Monte Carlo (MCMC) sampling. The 
drawback of the Structure method is that it is slow compared to eigenanalysis and cannot be used 
efficiently for large datasets. Recently, some methods such as ADMIXTURE (Alexander et al., 
2009) and Frappe (Tang et al., 2005) have proposed computational improvements to Structure using 
faster optimization methods for learning the Structure model parameters. 

Various extensions to Structure have been proposed to account for evolutionary processes such 
as mutation (Shringarpure and Xing, 2009) and recombination (Falush et al., 2003). Shringarpure 
and Xing (2009) extend the Structure model by allowing allele mutations in the assumed evolu- 
tionary model. Their results on microsatellite data from the 52 populations in the HGDP show 
that modeling allele mutations affects the ancestry inference and the inferred ancestry proportions 
(mixed memberships) for the individuals. They also present results on the accumulated mutation 
among the individuals relative to the inferred ancestral populations. However, the mStruct model 
fails to extract any mutation information from the HGDP SNP data, possibly due to a simplistic 
mutation model for SNP data. 

In the following sections, we present a modification of the mStruct model, which we call blockm- 
Struct, for analysis of SNP data. Our model assumes that a modern genome is composed of SNP 
halotype blocks which are not linked to each other and that mutations occur within a haplotype 
block. We will first introduce the Structure model, present the mStruct model as an improvement to 
Structure, and then develop blockmStruct as a modification that can analyze SNP data and account 
for linkage disequilibrium. 


19.3 The Structure Model 

The Structure model by Pritchard et al. (2000) represents the earliest uses of mixed membership 
models in the context of modeling genetic data. It assumes that genomes of modern individuals 
are composed of a mixture of ancestral populations. The details of the Structure model can be 
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understood by examining the choice of representation of individuals, ancestral populations, and the 
underlying generative process. 

19.3.1 Representation of Individuals 

For the following discussion, we will assume that the genomic data for each of the N individuals is 
a set of / loci and at the ith loci; the alleles observed are represented as {a^i, • • • , at, Li } (therefore 
L, denotes the number of alleles observed at locus i ). We consider the case of diploid human data, 
i.e., each chromosome has two copies. Therefore, the eth copy (e € {1, 2}) of the ?'th locus in the nth 
individual can be represented as x.i tric £ {oj i, • • • , CbiL, }. This representation can be used for all 
polymorphic markers, for instance, microsatellites (repeats of a 6-8 base pair DNA unit, represented 
as integer counts) and SNPs (single nucleotide polymorphisms, represented as 0/1). 

The Structure model assumes that all the loci in an individual’s genome are independent 
of each other. The genome for the ??th individual is therefore given by the set of alleles 

{xi 

,ni i 5 

19.3.2 Representation of Ancestral Populations 

An intuitive representation for characterizing the allelic diversity observed at a polymorphic locus is 
in terms of their allele frequencies. These (multinomial) allele frequency distributions are called al- 
lele frequency profiles (or APs) (Falush et al., 2003). The Structure model represents ancestral popu- 
lations as a collection of allele frequency profiles, one per locus. We can represent an ancestral pop- 
ulation k by a unique set of population-specific multinomial distributions A k = {X k , i = 1, • • • , /}, 
where X k = [A^, • • • , X k L ] is the vector of multinomial parameters, also known as an AP (Falush 
et ah, 2003), of the allele distribution at locus i in ancestral population L, denotes the total number 
of observed marker alleles at locus i\ and I denotes the total number of marker loci. This represen- 
tation, known as population-specific allele-frequency profiles, is used by the program Structure. 

Under an AP, the probability of an allele x at locus i given its ancestral population of origin k is 
given by 

Li 

P(x i \X k )=J^l[x i =ai]Xl l , (19.1) 

i 

where X[] is the indicator function which takes value 1 when the included condition is true and 0 
otherwise. 

In the case of SNPs, the multinomial distribution can be reduced to a Bernoulli distribution since 
there are only two alleles. Then we can represent X\ as the parameter of the Bernoulli distribution 
for the kth ancestral population at the ?'th SNP 

19.3.3 Generative Process 

In a general mixed membership model, the nth individual is represented by a mixed membership 
vector (or ancestry vector) X n = {A n) i, • • • , A n ,i<} that represents the individual-specific fractional 
contributions of the K ancestral populations to the genome. For every individual, the alleles at all 
loci may be inherited from founders in different ancestral populations, each represented by a unique 
distribution of founding alleles and the way they can be inherited. Formally, this scenario can be 
captured in the following generative process: 

1. For each individual n, draw the mixed membership vector (or ancestry vector) A„ ~ P(-\a), 
where P(-|a) is a pre-chosen structure prior. 
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2. For each marker allele x-^ rlf , £ x n 

(a) Draw the latent ancestral-population-origin indicator Zi ne ~ Multinomial( jA n ). 

(b) Draw the allele Xi iUe \zi iHe = k ~ Pf.(- |A^). 

In Structure, the ancestral populations are represented by a set of population-specific APs. Thus 
the distribution Pk(-\X k ) from which an observed allele can be sampled is a multinomial distribution 
defined by the frequencies of all observed alleles in the ancestral population, i.e., x t . ni: \zi, He = k ~ 
Multinomial (j /if). Figure 19.2 shows the graphical model representation of Structure. 



FIGURE 19.2 

Graphical model representation of Structure. For convenience, we have ignored the diploid nature 
of the observation. The shaded node indicates the variables we observe. 

This model has been generalized to allow linked loci and correlated allele frequencies (Falush 
et al., 2003). It has been successfully applied to human genetic data in Rosenberg et al. (2002). 
Figure 19.3 shows the ancestry vectors representation for 1,048 individuals from 53 groups in the 
Human Genome Diversity Project (HGDP) (Rosenberg et al., 2002). From the figure, we can see 
that ancestry vectors for individuals within a continent are more similar to each other than ancestry 
vectors for individuals in different continents. 


19.4 The mStruct Model 

The Structure model provides a method for inferring the ancestry of individuals as contributions 
from multiple (hypothetical) ancestral populations represented as APs. But a serious pitfall of using 
such a model is that there is no mutation model for individual alleles with respect to the common 
prototypes, i.e., every unique allele measurement at a particular locus is assumed to correspond to 
a unique ancestral allele, rather than allowing the possibility of it just being derived from some 
common ancestral allele at that locus as a result of a mutation (Excoffier and Hamilton, 2003). The 
mStruct model was proposed as an extension of the Structure model to account for the possibility of 
allele mutations (Shringarpure and Xing, 2009). We will present the mStruct model in terms of its 
differences with respect to the Structure model, which is the representation of ancestral populations 
and the generative process. 
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FIGURE 19.3 

Ancestry vectors for 1,048 individuals from the Human Genome Diversity Project with K ranging 
from 2 to 6. 

19.4.1 Representation of Ancestral Populations 

An AP does not enable us to model the possibility of mutations, i.e., there is no way of representing 
a situation where two observed alleles might have been derived from a single ancestral allele by two 
different mutations. This possibility can be represented by a genetically more realistic statistical 
model known as the population-specific mixture of ancestral alleles (MAA). For each locus i, an 
MAA for ancestral population k is a set 0* = {/z^, 5% , /3,f} consisting of three components: 1) a 
set of ancestral (or founder) alleles p* = {/zJt, . . . , p \ L , }, which can differ from their descendent 
alleles in the modern population; 2) a mutation parameter associated with the locus, which can 
be further generalized to be allele-specific if necessary; and 3) an AP which now represents the 
frequencies of the ancestral alleles. Here L' denotes the total number of ancestral alleles at loci i, 
which is different from Li in the previous subsection, which denotes the total number of observed 
alleles at loci i. By explictly associating a mutation model with an ancestral population, we can now 
capture mutation events as described above. It is important to note that the mutation parameter 5 is 
not the mutation rate commonly referred to in the literature. As we shall see later, it is a measure 
of the variability of a locus which can be described approximately as the combined effect of the 
per-generation mutation rate and the age of the population. 

An MAA is more expressive than an AP, because the incorporation of a mutation model helps 
to capture details about the population structure which an AP cannot; and the MAA reduces to the 
AP when the mutation rates (and hence the mutation parameters) become zero and the founders 
are identical to their descendents. MAA is also arguably more realistic because it allows mutation 
rates (and mutation parameters) to be different for different founder alleles, even within the same 
ancestral population, as is commonly the case with many genetic markers. For example, the mutation 
rates for microsatellite alleles are believed to be dependent on their length (number of repeats). As 
we shall show shortly, with an MAA, one can examine the mutation parameters corresponding to 
each ancestral population via Bayesian inference from genotype data; this might enable us to infer 
the age of alleles and also estimate population divergence times subject to a calibration constant. 

Under an MAA specific to an ancestral population k, the correspondence between a marker 
allele X ln< and a founder /j-'j G p k -' is not directly observable. For each allele founder we 
associate with it an inheritance model pf\Pii, 5^) from which descendants can be sampled. Then, 
given specifications of the ancestral population from which X lr , r is derived, which is denoted by 
hidden indicator variable Z / rlt] , the conditional distribution of X lJlt under MAA follows a mixture 
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L 

P{xi, ne = a it i> I z i>ne = k) = x p{xi,n e (19.2) 

i=i 

Comparing to the counterpart of this function under AP: -P(a;.j >rae = a, | = k) = we 

can see that the latter cannot explicitly model allele diversities in terms of molecular evolution from 
the founders. 

19.4.2 Generative Process 

Recall that in an MAA for each locus we define a finite set of founders with prototypical alleles 
/i* = . . . , p 1 ? L . } that can be different from the alleles observed in a modern population; each 

founder is associated with a unique frequency and a unique (if desired) mutation model from the 
prototype allele parameterized by rate <5*,. Under this representation, the distribution Pk{-\Pi) from 
which an observed allele can be sampled becomes a mixture of inheritance models each defined on 
a specific founder; and the ensuing sampling module that can be plugged into the general admixture 
scheme outlined earlier (to replace step 2) becomes a two-step generative process: 

1. Draw the latent founder indicator | 2 .j jrae = A: ~ Multinomial(-|/3f ); 

2. Draw the allele x.i tUe |c.j i „ e =l,z. itrie =k ~ Pm(-\Hi,i,Si,i), 

where P m Q is a mutation model that can be flexibly defined based on whether the genetic markers 
are microsatellites or single nucleotide polymorphisms. 

For simplicity of presentation, in the model described above, we assume that the set of founder 
alleles (but not their frequencies) at a particular locus is the same for all ancestral populations (i.e., 
/r* = pfi. We shall also assume that the mutation parameters for each population at any locus are 
independent of the alleles at that locus (i.e., = Sf ). Also, our model assumes Hardy- Weinberg 

equilibrium within populations and that loci are not linked to each other. 

Figure 19.4 shows the graphical model representation of mStruct. Comparing it to Figure 19.2, 
we can see that mStruct includes an extra step in the generative process that allows for mutations. 

Microsatellite Mutation Model 

Microsatellites are a class of tandem-repeat loci that involve a DNA unit that is 1-4 basepair in 
length. Microsatellite DNA has significantly higher mutation rates as compared to other DNA, with 
mutation rates as high as 10 3 or 10 -4 (Kelly et ah, 1991; Henderson and Petes, 1992). The large 
amount of variations present in microsatellite DNA make it ideal for differentiating founder pat- 
terns between closely related populations. Microsatellite loci have been used before in DNA fin- 
gerprinting (Queller et ah, 1993), linkage analysis (Dietrich et ah, 1992), and in the reconstruction 
of human phylogeny (Bowcock et ah, 1994). By applying theoretical models of microsatellite evo- 
lution to data, questions such as time of divergence of two populations can be attempted to be 
addressed (Pisani et ah, 2004; Zhivotovsky et ah, 2004). 

The choice of a suitable microsatellite mutation model is important, for both computational 
and interpretation purposes. Below we discuss the mutation model that we use and the biological 
interpretation of the parameters of the mutation model. We begin with a stepwise mutation model 
for microsatellites widely used in forensic analysis (Valdes et ah, 1993; Lin et ah, 2006). 

This model defines a conditional distribution of a progeny allele b given its progenitor allele a. 
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FIGURE 19.4 

Graphical model representation of mStruct. For convenience, we have ignored the diploid nature of 
the observation. The shaded node indicates the variables we observe. 


both of which take continuous values: 

p{b\a) = \i{ 1 - (19.3) 

where £ is the mutation rate (probability of any mutation), and 5 is the factor by which muta- 
tion decreases as distance between the two alleles increases. Although this mutation distribution is 
not stationary (i.e., it does not ensure allele frequencies to be constant over the generations), it is 
commonly used in forensic inference due to its simplicity. To some degree 6 can be regarded as a 
parameter that controls the probability of unit-distance mutation, as can be seen from the following 
identity: p(b + l\a)/p(b\a) = 6. 

In practice, the alleles for almost all microsatellites are represented by discrete counts. The 
two-parameter stepwise mutation model described above complicates the inference procedure. We 
propose a discrete microsatellite mutation model that is a simplification of Equation (19.3), but cap- 
tures its main idea. We posit that: P(b\a) ex Since b e [l,oo), the normalization constant of 

this distribution is: 


£ p (%) = E 5 ° _& + E ^ 

b = 1 6=1 b — a -\- 1 

_ 1-5“ 5 

~ 1-5 + 1 - 5 
_ 1 + 5 — 5“ 

“ 1-5 ’ 

which gives the mutation model as 

p(%) = i (19 - 4) 

We can interpret 5 as a variance parameter, the factor by which probability drops as a fuction of 
the distance between the mutated version b of the allele a. Figure 19.5 shows the discrete pdf for 
various values of S. 
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FIGURE 19.5 

Discrete pdf for two values of mutation parameter. 


Determination of Founder Set at Each Locus 

According to our model assumptions, there can be a different number of founder alleles at each 
locus. This number is typically smaller than the number of alleles observed at each marker since 
the founder alleles are “ancestral.” To estimate the appropriate number and allele states of founders, 
we fit finite mixtures (of fixed size, corresponding to the desired number of ancestral alleles) of 
microsatellite mutation models over all the measurements at a particular marker for all individuals. 
We use the Bayesian information criterion (BIC) (Schwarz, 1978) to determine the best number 
and states of founder alleles to use at each locus, since information criteria tend to favor smaller 
numbers of founder alleles which fit the observed data well. 

For each locus, we fit many different finite-sized mixtures of mutation distributions, with the size 
varying from 1 to the number of observed alleles at the locus. For each mixture size, the likelihood 
is optimized and a BIC value is computed. The number of founder alleles is chosen to be the size 
of the mixture that has the best (minimum) BIC value. We can do this as a pre-processing step 
before the actual inference or estimation procedures. This is possible since we assumed that the set 
of founder alleles at each locus was the same for all populations. 

19.4.3 Result on HGDP Data 

Analyzing the HGDP data described earlier allows us to gain more insights about the structure of 
human populations. Figure 19.6 shows the results of population structure analysis of the HGDP data 
using mStruct for I\ = 4 (chosen using BIC) compared with results using Structure for the same 
value of K. 

From the figure, we can see that while both methods achieve similar clusters of individuals by 
continent, the mStruct ancestral proportions indicate a significant level of similiarity even between 
individuals from different continents. 

Analysis of Mutations 

mStruct also models the mutations from alleles in ancestral populations to the observed alle- 
les in modem populations. This information can be used to reconstruct the estimated mutation 
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FIGURE 19.6 

Population structure using Structure and mStruct for K = 4. 


accumulated within an individual. Using the latitude and longitude labels associated with the lo- 
cations for each individual, we can construct a function that maps the geographical coordinates 
(latitude and longitude) to an estimate of the accumulated mutation at that location. Figure 19.7 
shows the contours of this function overlaid on the world map. 



FIGURE 19.7 

Contours of accumulated mutations in modern populations using mStruct for K = 4. Darker (more 
red) colors indicate higher accumulated mutations and lighter (more blue) colors indicate lower 
accumulated mutations. 

The accumulated mutation at a location, which is plotted in Figure 19.7, depends on two factors 
which are effective simultaneously — the mutation rate per base per generation and the number of 
generations between the ancestral and modern populations at a particular location. If we assume 
that mutation rates do not vary significantly by geography, then the accumulated mutation at a 
location is a proxy for the number of the generations between the ancestral and modern populations 
at a location, i.e, how long ago the location was first inhabited. In this aspect, Figure 19.7 agrees 
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with the “Out of Africa” models that are commonly agreed to explain human migrations across the 
world (Hammer et al., 1998). 


19.5 The blockmStruct Model 

The MAA model for populations described earlier is effective in modeling the mutations in mi- 
crosatellite marker lengths. Experiments with SNP data show that such a model is inadequate to 
model mutations in SNPs, which have only two allelic states. Due to its higher density across the 
genome, SNP data is also often found to show much larger linkage among adjacent loci than mi- 
crosatellites. 

We modify the “population-specific mixture of ancestral alleles” representation to propose a 
“population-specific mixture of ancestral haplotype blocks” (MAH, or mixture of ancestral haplo- 
types). For a locus i and ancestral population k, we assume that there are three components: (1) a 
set of ancestral (or founder) haplotype blocks pq = {p^ 1 , . . . , p \ L , }, which can differ from their 
descendant haplotype blocks in the modern population; (2) a mutation parameter 5% associated with 
the locus, which can be further generalized to be allele-specific if necessary; and (3) an AP / 3 
which now represents the frequencies of the ancestral haplotype blocks. Here L[ denotes the total 
number of ancestral haplotype blocks present at locus i, which is different from L, in the previ- 
ous section, which denotes the total number of observed haplotype blocks at loci i. By explicitly 
associating a mutation model with an ancestral population, we can now capture mutation events. 
It is important to note that the mutation parameter 5 is not the mutation rate commonly referred 
to in the literature. As we shall see later, it is a measure of the variability of a locus that can be 
described approximately as the combined effect of the per-generation mutation rate and the age of 
the population. 


19.5.1 Representation of Modern Genomes 

The blockmStruct model assumes that each chromosome of an individual’s genome is composed 
of J unlinked haplotype blocks. The genome of the nth individual is therefore given by the set 
• ' ’ ,yj,n 0 ,yj, ni }- The length of the jth haplotype block, given by lj, is the number 
of SNPs included in the jth haplotype block, leading to the identity that J2j = i lj = T. The jth 
haplotype block for the nth individual, yj, nB = {xj lt1 , Xj^i -i,n e } where ji,ji + lj — 
1 £ {1, •••,/}, are indices denoting the left-most and right-most boundaries of the jth block, 
respectively. Figure 19.8 shows the blockmStruct representation of haplotype blocks. 
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FIGURE 19.8 

Representation of haplotype blocks in blockmStruct. For notational convenience, we drop the e 
subscript denoting ploidy and the n subscript denoting an individual in the diagram. Haplotype 
block y 3 is composed of lj SNPs Xj 1 to Xj n +/,--! . In this example, yj = 10101000 and lj = 8. 
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The lengths and boundaries of the haplotype blocks are assumed to be the same for all indi- 
viduals and can be chosen according to different strategies. In our experiments, we assume that all 
haplotype blocks have fixed length b. The following discusses the possible strategies of choosing 
haplotype blocks and their advantages and disadvantages. 

Strategies for Choosing Haplotype Blocks 

Haplotype blocks can be chosen according to various criteria. A few commonly used criteria are: 

• Fixed number of SNPs per block. This is the simplest strategy for choosing haplotype blocks 
and is the most efficient to compute. 

• Length in KB or MB of the haplotype block. This requires knowledge of the positions of the 
SNPs in the genome and can produce blocks of variable length. A useful heuristic is to use the 
knowledge of the range of linkage disequilibrium to pick a single block length. 

• Choosing boundaries when the linkage disequilibrium (or correlation) between adjacent SNPs 
drop below a pre-specified threshold. This allows us to create blocks consistent with the earlier 
assumption of unlinked haplotype blocks. 

• By using a haplotype inference program. A number of programs are available for inferring 
haplotype blocks, and using one of them is likely to produce the most accurate haplotype blocks. 
However, this inference is often too computationally expensive to be efficient. 


19.5.2 The Generative Process 

We propose to represent each ancestral population by a set of population-specific MAHs. This re- 
sults in a generative process similar to the one defined for mStruct : 

• For each individual n, draw the mixed membership vector (or ancestry vector): A„ ~ P(-\a), 
where P(-\a) is a pre-chosen structure prior. 

• For each marker allele Xi tUe £ x„ 

- 2.1: Draw the latent ancestral-population-origin indicator ~ Multinomial(-|A ra ); 

- 2.2a: Draw the latent founder indicator Cj i?le | Zj ; „ e = k ~ Multinomial (• 1/3*); 

- 2.2b: Draw the haplotype block y ltTl J\c l . ne =l’,z i>nfl =k ~ 

where P m () is a mutation model that we will define below to capture mutations within haplotype 
blocks. 

Figure 19.9 shows the graphical model representation of blockmStruct. We can see by comparing 
it to Figure 19.4 that blockmStruct differs from mStruct in its representation of modern individual 
genomes. This difference also affects the ancestral alleles inferred by the methods. mStruct and 
blockmStruct also differ in the mutation models they use. 

Mutation Model 

We assume a mutation model for the haplotype that assumes that within a haplotype block, mu- 
tations occur independently of each other with a fixed probability. Let d(y, ft) be the number of 
SNPs which are different in the two haplotypes y and p (also referred to as the Manhattan distance 
between the two haplotype blocks). The parameter S £ [0, 1] is the probability of a single mutation 
in the haplotype block. If the size of both blocks y and // is assumed to be b, it is easy to show that 

P(y\fi,5) = S div ^( 1 - $)*-<*(».**). 


(19.5) 
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FIGURE 19.9 

Graphical model representation of blockmStruct . For convenience, we have ignored the diploid na- 
ture of the observation. The shaded node indicates the variables we observe. 
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FIGURE 19.10 

A demonstration of how using different mutation models can lead to different inferences of ancestry. 
The dark shaded squares indicate sites of mutation. On the left, the shaded columns indicate use of 
single SNP mutation models. On the right, the shaded rows indicate use of mutation models over 
SNP blocks. 


This model is an intuitive representation of the possibility of SNPs switching allelic state. The 
assumption of independence among mutations within a haplotype block is similar to a mutation 
model for single SNPs. 

Figure 19.10 uses toy data to demonstrate the effects of different mutation models on ancestry 
inference. In a mutation model over single SNPs, mutations are inferred to have occurred at a single 
location. In the haplotype mutation model, inference leads to two potential sites of mutation. Thus, 
the additional constraints imposed by the existence of ancestral haplotype blocks can lead to the 
model capturing more mutational information. 

Another advantage of the block mutation model combined with the MAH representation of 
populations is that it implicitly models linkage between adjacent loci. By constraining the ancestral 
alleles to be haplotype blocks and not individual SNPs, we can account for linkage between alleles 
at physically adjacent loci. 
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19.5.3 Inference and Parameter Estimation 

For notational convenience, we will ignore the diploid nature of observations in the analysis that 
follows. With the understanding that the analysis is carried out for an arbitrary ?rth individual, we 
will drop the subscript n. We overload the indicator arrays z. t and c, to also use them as scalar index 
variables, as well as scalars with a value equal to the index at which the array forms have Is. In 
other words: z, £ {1, . . . , K} or Zi = [ , . . . , Zi : x], where Zi t k = I[z-i = k], and ![■] denotes 
an indicator function equal to 1 when the predicate argument is true and 0 otherwise. A similar 
overloading is also assumed for the c,; variables. We use ) to denote our mutation 

model for haplotypes. 

The joint probability distribution of the the data and the relevant variables under the blockm- 
Struct model can then be written as: 


P{y,z,c,X\a,f,p,5) 

I 

=P (A|a) P («i|A) P (ci\zi,/3i =1[K ^P (yi\ci, Zi,m, Sf =1:K J . 

i= 1 


The marginal likelihood of the data can be computed by summing/integrating out the latent 
variables: 

r 

P(y\a,/3,d,S) = V k 

1 lk=l 

I K 

xri£ 

i= 1 k= 1 

X P ‘ ‘ h d\. 

However, a closed-form solution to this summation/integration is not possible, and indeed exact 
inference on hidden variables such as the mixed membership vector A and estimation of model 
parameters such as the mutation rates 6 under blockmStruct is intractable. We use a variational 
inference algorithm as described in Shringarpure and Xing (2009). 



Variational Inference 

We use a mean field approximation for performing inference on the model. This approximation 
method estimates an intractable joint posterior p() of all the hidden variables in the model by a 
product of marginal distributions q() = ]~[ c/, (), each over only a single hidden variable. The op- 
timal parameterization of q, () for each variable is obtained by minimizing the Kullback-Leibler 
divergence between the variational approximation q and the true joint posterior p. Using results 
from the generalized mean field theory (Xing et ah, 2003), we can write the variational distributions 
of the latent variables in blockmStruct as follows: 


I< 

/\\ r \ t~t~ y — . (zi fc) 

g(A)oc||A fc 

/c = i 

L / K 

q(Ci ) ex n ( II 

z=l \fc=l 

q( Zi ) cx n 

k=l \ \l=l 



Zi,k 



Population Stratification with Mixed Membership Models 


411 


In the distributions above, the are used to indicate the expected values of the enclosed random 
variables. A close inspection of the above formulas reveals that these variational distributions have 
the form q (A) ~ Dirichlet( 7 .i, . . . , 7 .x). q(zi) ~ Multinomial (p _ i>1 , . . . , Pa,k)- an d q(ci) ~ 
Multinomial(^.j 1 , . . . , l), respectively, of which the parameters 7 k, Pi,k, and Ci,l are given by 

the following equations: 


3fk ^ k T ^ ' (Zi,k) ? 

i = 1 

e 0og(A fc) ) 

Eti ’ 

nf=i ! ° 

and they have the properties: (log(A fc )) = ^( 7 *) - ^(EfcTfc). (^,fc) = Pi,fc, and ( c i} i ) = &,i, 

which suggest that they can be computed via fixed point iterations. (The digamma function fi>() 
used above is the first derivative of the logarithm of the gamma function T().) It can be shown 
that this iteration will converge to a local optimum, similar to what happens in an EM algorithm. 
Empirically, a near global optimal can be obtained by multiple random restarts of the fixed point 
iteration. Upon convergence, we can easily compute an estimate of the ancestry vector A for each 
individual from q( A). 

Parameter Estimation 

The parameters of our model are the centroids p , the mutation parameters 5, the ancestral allele fre- 
quency distributions (3, and the Dirichlet hyperparameter that is the prior on ancestral populations — 
a. For the hyperparameter estimation, we perform empirical Bayes estimation using the variational 
expectation maximization (variational EM) algorithm described in Blei et al. (2003). The variational 
inference described in Section 19.5.3 provides us with a tractable lower bound on the log-likelihood 
as a function of the current values of the hyperparameters. We can thus maximize it with respect to 
the hyperparameters. If we alternately carry out variational inference with fixed hyperparameters, 
followed by a maximization of the lower bound with respect to the hyperparameters for fixed values 
of the variational parameters, we can get an empirical Bayes estimate of the hyperparameters. The 
derivation, details of which we will not show here, leads to the following iterative algorithm: 

1. ( E-step ) For each individual, find the optimizing values of the variational parameters 

( 7 n , p n , 1 , . . . , N ) using the variational updates described above. 

2. ( M-step ) Maximize the resulting variational lower bound on the likelihood with respect to the 
model parameters, namely a, f3, p. S. 

The two steps are repeated until the lower bound on the log-likelihood converges. The estimation of 
hyperparameters for blockmStruct is identical to that of mStruct, with the alleles x replaced by the 
haplotype blocks y. We therefore refer the reader to Shringarpure and Xing (2009) for mathematical 
details of the parameter estimation. 
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FIGURE 19.11 

Effect of varying mutation rate on accuracy of individual ancestry recovery. 


19.6 Experiments 

We first validate blockmStruct using simulation data to examine the accuracy of ancestry recovery 
with varying mutation rates. We then use the blockmStruct model to analyze data from the HGDP. 

19.6.1 Simulation Experiments 

We will demonstrate the accuracy of the blockmStruct model through the task of recovering in- 
dividual ancestry using simulated data. We use the coalescent software ms (Hudson, 1990) with 
recombination to generate data for our simulation. Most coalescent simulation software, including 
ms, assumes an infinite site model of mutation, disallowing recurrent mutations. This would gener- 
ate simulation data that would violate the assumptions of the blockmStruct model. We therefore use 
the coalescent software to generate genealogy trees at 500 loci. At each locus, we assume that the 
unit of inheritance is a block of 5 SNPs. Mutations are placed on the branches of the genealogy trees 
according to a poisson distribution with probability proportional to the branch lengths and applied 
to the blocks. We simulate a two-population admixture with 200 individuals in the resulting popu- 
lation. The recombination probability between adjacent bases is set to 10 -8 per generation and the 
effective population size N is set to 10 4 to approximate parameter values for human populations. 
We assume that the mutation parameter AN n is variable and has values of {0.4, 0.8, 1.6, 3.2}, corre- 
sponding to parameter values for human populations. To examine the effect of modeling mutation, 
we compare our results to Structure using the haplotype blocks as alleles. We compare the perfor- 
mance of Structure and blockmStruct in terms of their error in recovering the ancestry proportions 
A. 

The results are shown in Figure 19.11. We find that when the mutation rate 4 Nfj, is low. Structure 
performs as good as, or even better than, blockmStruct. As the mutation rate rises, the error in 
ancestry recovery rises for both methods. For Structure, this is expected since it has no model for 
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mutations. For blockmStruct, this effect is a result of the mismatch between the mutation model used 
to simulate data and the mutation model assumed by blockmStruct. However, since blockmStruct has 
a mutation model, its ancestry recovery error increases much slower than that for Structure. This 
demonstrates the utility of modeling mutations in haplotype blocks. 


19.6.2 Analysis of HGDP+HapMap Data 

We analyzed a dataset containing high-density SNP genotyping data for 597 individuals from the 
HGDP and the HapMap project. For computational reasons, we used 10,000 SNPs on chromosome 
1 for our ancestry analysis. The number of ancestral populations I\ was varied from 2 to 6. We used 
SNP blocks of size 15, so that the correlation between the left (or right) endpoints of two consecutive 
windows was less than 0.25 in 90% of the windows. 




r1> 




<& 


^ & 


Si 


<0 




Sr' 


& 

s' s 


K= 2 


K= 3 


K= 4 


K= 5 


K= 6 


FIGURE 19.12 

Ancestry vectors for 597 individuals from the Human Genome Diversity Project with K ranging 
from 2 to 6, inferred using blockmStruct. 

From Figure 19.12, we can see that the ancestry vectors produced by blockmStruct cluster indi- 
viduals by their continental divisions. For K = 2, the individuals are divided into an African and a 
non-African cluster. For K = 3, the non- African cluster separates into a cluster containing Europe, 
the Middle East, and Central-South Asia, and a second cluster containing the Americas, East Asia, 
and Oceania. At K = 4, the Oceanian populations separate into a cluster of their own. At K = 5, 
the population component corresponding to the Americas and East Asia splits into two components. 
One of the two components corresponds to American populations. The other component appears to 
varying degrees in the Asian, European, and Middle Eastern populations. At K = 6, the ancestral 
population component for the African population splits into two components which display vary- 
ing degrees of admixture for different African groups. For larger values of K, the new population 
components add little interpretative value. 

The population structure uncovered by blockmStruct shows significant similarities and differ- 
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ences with the population structure inferred by Structure and mStruct on the same data. Population 
stmcture analysis of the HGDP data by Structure (Figure 19.3) shows that ancestral populations 
correspond to geographical divisions, similar to the inference by blockmStruct for small values of 
K. For larger values of K, Structure infers population clusters that correspond completely to one 
(or more) regional groups. blockmStruct , on the other hand, infers population components for higher 
values of K that contribute partially to multiple regional populations. This behavior is similar to the 
results for mStruct seen in Figure 19.6. However, the ancestry proportions inferred by blockmStruct 
do not exhibit the same degree of ancestry sharing as seen in Figure 19.6. 

We find that for all values of K, the model infers mutation parameters that are significantly 
larger than zero. However, in all cases, the method fails to uncover meaningful mutational structure. 
We discuss this behavior further in the Section 19.7. 


19.7 Discussion 

The Structure model by Pritchard et al. (2000) models admixing of ancestral populations but does 
not model allele mutations. The mStruct model by Shringarpure and Xing (2009) extends Structure 
by modeling allele mutations for microsatellite markers and demonstrates that modeling allele mu- 
tations affects ancestry inference and produces more accurate ancestry estimates when mutations 
are present. However, mStruct fails to account for mutations in SNP data and produces results iden- 
tical to that of Structure on SNP data. We have developed blockmStruct , a model for performing 
ancestry inference on dense SNP data while modeling allele mutations in haplotype blocks. 

We validated our model using simulated genotype data to conclude that the method can recover 
ancestry more accurately than previous methods even with high mutation rates in SNPs. Our analy- 
sis of the HGDP+HapMap data indicates that the mutation model in blockmStruct affects the results 
of ancestry inference and produces population stmcture that shares similarities with both Structure 
and mStruct. As in the Structure analysis, the population structure corresponds broadly to continen- 
tal geographic divisions. Like the mStruct analysis, the inferred ancestral population components 
show partial membership in multiple modern populations, producing a higher degree of population 
sharing. However, the results also show important differences compared to results from analyses 
of the same data using Structure and mStruct. In Structure, most individuals are assigned mem- 
bership almost completely to a single ancestral population while the blockmStruct analysis assigns 
admixed ancestry to a number of individuals. The degree of admixture assigned to individuals by 
blockmStruct is not as high as that inferred by mStruct. This suggests that the choice of representing 
modern individuals as haplotype blocks allows us to capture relationships between individuals more 
accurately using SNP data than either Structure or mStruct permit. 

Another advantage of the haplotype block representation is that it offers us a way of model- 
ing linkage between adjacent loci. Structure and mStruct both assume no linkage between loci. 
While both methods are robust to some degree of linkage disequilibrium, such an assumption is not 
appropriate for high-density SNP data. As demonstrated earlier, a haplotype blockmodel indirectly 
accounts for linkage disequilibrium through the ancestral haplotype blocks. This is in alternative to 
the explicit modeling of linkage using hidden markov models, for instance in Falush et al. (2003), 
which requires more complex inference procedures. blockmStruct offers a computationally efficient 
way of modeling linkage and mutations simultaneously in population structure. In our analyses, we 
assumed a fixed-length model for haplotype blocks by examining the range of linkage in the ob- 
served SNPs. More accurate modeling of haplotype blocks, using variable-length blocks based on 
linkage decay or a haplotype inference method, can enable more accurate modeling of linkage at 
the cost of increased computation. A computationally efficient choice of haplotype blocks that can 
account for linkage more accurately remains a question for further study. 
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The ability to analyze accumulated mutation is an important advantage that mStruct offers over 
Structure when analyzing microsatellite data. With SNP data, mStruct fails to capture any muta- 
tional information, resulting in most mutation rates being inferred to be close to zero, resulting in 
mStruct and Structure producing identical population structure with SNP data. The mutation model 
of blockmStruct based on haplotype blocks is an extension of the single-SNP mutation model of 
mStruct. However, even though it infers non-zero values for most mutation parameters, it fails to 
recover any meaningful spatial structure in the accumulated mutation. This is likely to be a result 
of the mismatch between the assumed mutation model and the true mutation model over haplotype 
blocks. Alternative models of haplotype mutation may improve the recovery of mutational informa- 
tion in blockmStruct. 
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In this chapter we discuss some of the consequences of the mixed membership perspective on time 
series analysis. In its most abstract form, a mixed membership model aims to associate an individual 
entity with some set of attributes based on a collection of observed data. For example, a person (en- 
tity) can be associated with various defining characteristics ( attributes ) based on observed pairwise 
interactions with other people (data). Likewise, one can describe a document (entity) as comprised 
of a set of topics (attributes) based on the observed words in the document (data). Although much 
of the literature on mixed membership models considers the setting in which exchangeable collec- 
tions of data are associated with each member of a set of entities, it is equally natural to consider 
problems in which an entire time series is viewed as an entity and the goal is to characterize the time 
series in terms of a set of underlying dynamic attributes or dynamic regimes. Indeed, this perspective 
is already present in the classical hidden Markov model (Rabiner, 1989) and switching state-space 
model (Kim, 1994), where the dynamic regimes are referred to as “states,” and the collection of 
states realized in a sample path of the underlying process can be viewed as a mixed membership 
characterization of the observed time series. Our goal here is to review some of the richer model- 
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ing possibilities for time series that are provided by recent developments in the mixed membership 
framework. 

Much of our discussion centers around the fact that while in classical time series analysis it is 
commonplace to focus on a single time series, in mixed membership modeling it is rare to focus 
on a single entity (e.g., a single document); rather, the goal is to model the way in which multiple 
entities are related according to the overlap in their pattern of mixed membership. Thus we take a 
nontraditional perspective on time series in which the focus is on collections of time series. Each 
individual time series may be characterized as proceeding through a sequence of states, and the 
focus is on relationships in the choice of states among the different time series. 

As an example that we review later in this chapter, consider a multivariate time series that arises 
when position and velocity sensors are placed on the limbs and joints of a person who is going 
through an exercise routine. In the specific dataset that we discuss, the time series can be segmented 
into types of exercise (e.g., jumping jacks, touch-the-toes, and twists). Each person may select a sub- 
set from a library of possible exercise types for their individual routine. The goal is to discover these 
exercise types (i.e., the “behaviors” or “dynamic regimes”) and to identify which person engages in 
which behavior, and when. Discovering and characterizing “jumping jacks” in one person’s routine 
should be useful in identifying that behavior in another person’s routine. In essence, we would like 
to implement a combinatorial form of shrinkage involving subsets of behaviors selected from an 
overall library of behaviors. 

Another example arises in genetics, where mixed membership models are referred to as “ad- 
mixture models” (Pritchard et al., 2000). Here the goal is to model each individual genome as a 
mosaic of marker frequencies associated with different ancestral genomes. If we wish to capture 
the dependence of nearby markers along the genome, then the overall problem is that of capturing 
relationships among the selection of ancestral states along a collection of one-dimensional spatial 
series. 

One approach to problems of this kind involves a relatively straightforward adaptation of hidden 
Markov models or other switching state-space models into a Bayesian hierarchical model: transi- 
tion and emission (or state-space) parameters are chosen from a global prior distribution and each 
individual time series either uses these global parameters directly or perturbs them further. This ap- 
proach in essence involves using a single global library of states, with individual time series differ- 
ing according to their particular random sequence of states. This approach is akin to the traditional 
Dirichlet-multinomial framework that is used in many mixed membership models. An alternative is 
to make use of a beta-Bernoulli framework in which each individual time series is modeled by first 
selecting a subset of states from a global library and then drawing state sequences from a model de- 
fined on that particular subset of states. We will overview both of these approaches in the remainder 
of the chapter. 

While much of our discussion is agnostic to the distinction between parametric and nonparamet- 
ric models, our overall focus is on the nonparametric case. This is because the model choice issues 
that arise in the multiple time series setting can be daunting, and the nonparametric framework pro- 
vides at least some initial control over these issues. In particular, in a classical state-space setting we 
would need to select the number of states for each individual time series, and do so in a manner that 
captures partial overlap in the selected subsets of states among the time series. The nonparametric 
approach deals with these issues as part of the model specification rather than as a separate model 
choice procedure. 

The remainder of the chapter is organized as follows. In Section 20.1.1, we review a set of time 
series models that form the building blocks for our mixed membership models. The mixed member- 
ship analogy for time series models is aided by relating to a canonical mixed membership model: 
latent Dirichlet allocation (LDA), reviewed in Section 20.1.2. Bayesian nonparametric variants of 
LDA are outlined in Section 20.1.3. Building on this background, in Section 20.2 we turn our focus 
to mixed membership in time series. We first present Bayesian parametric and nonparametric mod- 
els for single time series in Section 20.2.1 and then for collections of time series in Section 20.2.3. 
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Section 20.3 contains a brief survey of related Bayesian and Bayesian nonparametric time series 
models. 


20.1 Background 

In this section we provide a brief introduction to some basic terminology from time series analy- 
sis. We also overview some of the relevant background from mixed membership modeling, both 
parametric and nonparametric. 

20.1.1 State-Space Models 

The autoregressive (AR) process is a classical model for time series analysis that we will use as a 
building block. An AR model assumes that each observation is a function of some fixed number 
of previous observations plus an uncorrelated innovation. Specifically, a linear, time-invariant AR 
model has the following form: 


yt=Y, ( 20 . 1 ) 

i=l 

where y t represents a sequence of equally spaced observations, e t the uncorrelated innovations, and 
a.j the time-invariant autoregressive parameters. Often one assumes normally distributed innovations 
e t ~ Af(0, a 2 ), further implying that the innovations are independent. 

A more general formulation is that of linear state-space models , sometimes referred to as dy- 
namic linear models. This formulation, which is closely related to autoregressive moving average 
processes, assumes that there exists an underlying state vector x t £ R" such that the past and future 
of the dynamical process y t £ W l are conditionally independent. A linear time-invariant state-space 
model is given by 


x t = Ax t _ i +e t y t = Cx t + w t , (20.2) 

where e t and w f are independent, zero-mean Gaussian noise processes with covariances E and R, 
respectively. Here, we assume a vector-valued process. One could likewise consider a vector-valued 
AR process, as we do in Section 20.2.1. 

There are several ways to move beyond linear state-space models. One approach is to consider 
smooth nonlinear functions in place of the matrix multiplication in linear models. Another approach, 
which is our focus here, is to consider regime-switching models based on a latent sequence of 
discrete states { z t }. In particular, we consider Markov switching processes, where the state sequence 
is modeled as Markovian. If the entire state is a discrete random variable, and the observations {y t } 
are modeled as being conditionally independent given the discrete state, then we are in the realm of 
hidden Markov models (HMMs) (Rabiner, 1989). Details of the HMM formulation are expounded 
upon in Section 20.2. 1 . 

It is also useful to consider hybrid models in which the state contains both discrete and con- 
tinuous components. We will discuss an important example of this formulation — the autoregressive 
HMM — in Section 20.2.1. Such models can be viewed as a collection of AR models, one for each 
discrete state. We will find it useful to refer to the discrete states as “dynamic regimes” or “behav- 
iors” in the setting of such models. Conditional on the value of a discrete state, the model does not 
merely produce independent observations, but exhibits autoregressive behavior. 
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20.1.2 Latent Dirichlet Allocation 

In this section, we briefly overview the latent Dirichlet allocation (LDA) model (Blei et al., 2003) 
as a a canonical example of a mixed membership model. We use the language of “documents,” 
“topics,” and “words.” In contrast to hard-assignment predecessors that assumed each document 
was associated with a single topic category, LDA aims to model each document as a mixture of 
topics. Throughout this chapter, when describing a mixed membership model, we seek to define 
some observed quantity as an entity that is allowed to be associated with, or have membership 
characterized by, multiple attributes. For LDA, the entity is a document and the attributes are a 
set of possible topics. Typically, in a mixed membership model, each entity represents a set of 
observations, and a key question is what structure is imposed on these observations. For LDA, each 
document is a collection of observed words and the model makes a simplifying exchangeability 
assumption in which the ordering of words is ignored. 

Specifically, LDA associates each document d with a latent distribution over the possible topics, 
7r( d ), and each topic k is associated with a distribution over words in the vocabulary. Ok- Each 
word w^ 1 ’ is then generated by first selecting a topic from the document-specific topic distribution 
and then selecting a word from the topic-specific word distribution. 

Formally, the standard LDA model with K topics, D documents, and N d words per document d 
is given as 

Ok ~ Dir^!, . . . , rj v ) k = l,...,K 

7r (d) ~ Dir(/3i, . . . ,/3 k ) d = 1, . . . D 

4 d) | ttW ~ tt W 

w- d) | {0fc},4 d) ~ 

Here z[ d ' 1 is a topic indicator variable associated with observed word w\ d \ indicating which topic k 
generated this ?'th word in document d. In expectation, for each document d we have ] \ 0\ = 
/3fc. That is, the expected topic proportions for each document are identical a priori. 

20.1.3 Bayesian Nonparametric Mixed Membership Models 

The LDA model of Equation (20.3) assumes a finite number of topics K . Bayesian nonparametric 
methods allow for extensions to models with an unbounded number of topics. That is, in the mixed 
membership analogy, each entity can be associated with a potentially countably infinite number of 
attributes. We review two such approaches: one based on the hierarchical Dirichlet process (Teh 
et al., 2006) and the other based on the beta process (Hjort, 1990; Thibaux and Iordan, 2007). In 
the latter case, the association of entities with attributes is directly modeled as sparse. 

Hierarchical Dirichlet Process Topic Models 

To allow for a countably infinite collection of topics, in place of finite-dimensional topic- 
distributions = [7tJ d \ . . . , 7 r^] as specified in Equation (20.3), one wants to define distri- 
butions whose support lies on a countable set, = [7rJ d \ k./' 1 , . . .]. 

The Dirichlet process (DP), denoted by DP (a//), provides a distribution over countably infinite 
discrete probability measures 


d=l,...D, i = l,...,N d 
d=l,...D, i = l,...,N d . 


(20.3) 


G = Y. n kSs k e k ~ H (20.4) 

k = 1 


defined on a parameter space 0 with base measure H. The mixture weights are sampled via a 
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FIGURE 20.1 

Pictorial representation of the stick-breaking construction of the Dirichlet process. 


stick-breaking construction (Sethuraman, 1994): 


k- 1 




][J(1-^) i/fe - Beta(l,a). 


(20.5) 


This can be viewed as dividing a unit-length stick into lengths given by the weights tt/,. : the A th 
weight is a random proportion Vk of the remaining stick after the first (k — 1) weights have been 
chosen. We denote this distribution by n ~ GEM (a). See Figure 20.1 for a pictorial representation 
of this process. 

Drawing indicators Zi ~ 7r, one can integrate the underlying random stick-breaking measure 
tt to examine the predictive distribution of z, conditioned on a set of indicators zi, , Zi-i and 
the DP concentration parameter a. The resulting sequence of partitions is described via the Chi- 
nese restaurant process (CRP) (Pitman, 2002), which provides insight into the clustering properties 
induced by the DP. 

For the LDA model, recall that each 9k is a draw from a Dirichlet distribution (here denoted 
generically by H) and defines a distribution over the vocabulary for topic k. To define a model 
for multiple documents, one might consider independently sampling G^ d> ~ DP(afT) for each 
document d, where each of these random measures is of the form G ^ Unfor- 

tunately, the topic-specific word distribution for document d, 0\' ,] , is necessarily different from that 
of document d ! , 0^ \ since each are independent draws from the base measure H. This is clearly 
not a desirable model — in a mixed membership model we want the parameter that describes each 
attribute {topic) to be shared between entities ( documents ). 

One method of sharing parameters 9k between documents while allowing for document-specific 
topic weights n ^ is to employ the hierarchical Dirichlet process (HDP) (Teh et ah, 2006). The 
HDP defines a shared set of parameters by drawing 9k independently from H. The weights are then 
specified as 


/3 ~ GEM( 7 ) t r (d) | 13 ~ DP(a/3) . 


( 20 . 6 ) 


Coupling this prior to the likelihood used in the LDA model, we obtain a model that we refer 
to as HDP-LDA. See Figure 20.2(a) for a graphical model representation, and Figure 20.3 for an 
illustration of the coupling of document-specific topic distributions via the global stick-breaking 
distribution f3. Letting = 'Y^k=i' K \f > ^s k and G*- 0 -* = Y^k= iPk^e k , one can show that the 
specification of Equation (20.6) is equivalent to defining a hierarchy of Dirichlet processes (Teh 
et ah, 2006): 




(20.7) 


Thus the name hierarchical Dirichlet process. Note that there are many possible alternative formu- 
lations one could have considered to generate different countably infinite weights tt^ with shared 
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a 



(a) 


(b) 


FIGURE 20.2 

Graphical model of the (a) HDP-based and (b) beta-process-based topic model. The HDP-LDA 
model specifies a global topic distribution /3 ~ GEM(7) and draws document-specific topic dis- 
tributions as | /3 ~ DP(a/3). Each word w' d> in document d is generated by first draw- 
ing a topic-indicator zf^ \ tt^ ~ tt ^ and then drawing from the topic-specific word distri- 
bution: w i d) I {Ok},zl d) ~ e (d) . The standard LDA model arises as a special case when 3 
is fixed to a finite measure f3 = [B\, . . . , Bk\- The beta process model specifies a collection 
of sparse topic distributions. Here, the beta process measure B ~BP(l,Bo) is represented by 
its masses u>k and locations Ok, as in Equation (20.8). The features are then conditionally inde- 
pendent draws fdk | W& ~ Bernoulli (utfc), and are used to define document-specific topic distribu- 
tions 7 Tj :l) | fd, B ~ Dir {B (g> fd). Given the topic distributions, the generative process for the topic- 
indicators z\ d> and words is just as in the HDP-LDA model. 

atoms Ok- The HDP is a particularly simple instantiation of such a model that has appealing theo- 
retical and computational properties due to its interpretation as a hierarchy of Dirichlet processes. 

Via the construction of Equation (20.6), we have that E[tt^ | /3] = Bk- That is, all of the 
document-specific topic distributions are centered around the same stick-breaking weights /3. 


Beta-Bernoulli Process Topic Models 

The HDP-LDA model defines countably infinite topic distributions t in which every topic 1: has 
positive mass > 0 (see Figure 20.3). This implies that each entity ( document ) is associated 
with infinitely many attributes (topics). In practice, however, for any finite length document d, only 
a finite subset of the topics will be present. The HDP-LDA model implicitly provides such attribute 
counts through the assignment of words wf 1 to topics via the indicator variables zf'B 

As an alternative representation that more directly captures the inherent sparsity of association 
between documents and topics, one can consider feature-based Bayesian nonparametric variants of 
LDA via the beta-Bernoulli process, such as in the focused topic model of Williamson et al. (2010). 
(A precursor to this model was presented in the time series context by Fox et al. (2010), and is dis- 
cussed in Section 20.2.3.) In such models, each document is endowed with an infinite-dimensional 
binary feature vector that indicates which topics are associated with the given document. In contrast 
to HDP-LDA, this formulation directly allows each document to be represented as a sparse mix- 
ture of topics. That is, there are only a few topics that have positive probability of appearing in any 
document. 
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FIGURE 20.3 

Illustration of the coupling of the document-specific topic distributions tt' 7 via the global stick- 
breaking distribution /3. Each topic distribution has countably infinite support and, in expectation, 

E[ttW | j9] = 0 k . 


Informally, one can think of the beta process (BP) (Hjort, 1990; Thibaux and Jordan, 2007) 
as defining an infinite set of coin-flipping probabilities and a Bernoulli process realization as 
corresponding to the outcome from an infinite coin-flipping sequence based on the beta-process- 
determined coin-tossing probabilities. The set of resulting heads indicate the set of selected features, 
and implicitly defines an infinite -dimensional feature vector. The properties of the beta process in- 
duce sparsity in the feature space by encouraging sharing of features among the Bernoulli process 
realizations. 

More formally, let fd = [fdi, fd 2 , ■ • •] be an infinite-dimensional feature vector associated with 
document d, where fdk = 1 if and only if document d is associated with topic k. The beta process, 
denoted BP(c, If)), provides a distribution on measures 

OO 

B = Y / u k 6e k , (20.8) 

k = 1 

with Wfc £ (0, 1). We interpret io k as the feature-inclusion probability for feature fc (e.g., the fcth 
topic in an LDA model). This fcth feature is associated with parameter Of. . 

The collection of points {0k,tOk} are a draw from a non-homogeneous Poisson process with 
rate v(du>, dO) = cw -1 ( 1 — uj) c ~ 1 dujBo(dd) defined on the product space 0 (g) [0, 1], Here, c > 0 
and B 0 is a base measure with total mass If ){(-)) = a. Since the rate measure t] has infinite mass, 
the draw from the Poisson process yields an infinite collection of points, as in Equation (20.8). For 
an example realization and its associated cumulative distribution, see Figure 20.4. One can also 
interpret the beta process as the limit of a finite model with K features: 

K 

B k = '^2^kSe k u k ~ Beta (^ c(l “ ^)) S k ~ a~ 1 B 0 . (20.9) 

k= 1 

In the limit as K — > oo, B r< — > B and one can define stick-breaking constructions analogous to 
those in the Dirichlet process (Paisley et ah, 2010; 2011). For each feature fc, we independently 
sample 


fdk I ~ Bernoulli (w fc ). 


( 20 . 10 ) 


That is, with probability topic fc is associated with document d. One can visualize this process 
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FIGURE 20.4 

(a) Top : A draw B from a beta process is shown by the discrete masses, with the corresponding 
cumulative distribution shown above. Bottom : 50 draws Xi from a Bernoulli process using the beta 
process realization. Each dot corresponds to a coin flip at that atom in B that came up heads, (b) An 
image of a feature matrix associated with a realization from an Indian buffet process with a = 10. 
Each row corresponds to a different customer, and each column to a different dish. White indicates 
a chosen feature. 


as walking along the atoms of the discrete beta process measure B and, at each atom 9 k, flipping a 
coin with probability of heads given by u>k- More formally, setting Xd = fdkSg k , this process 

is equivalent to sampling Xd from a Bernoulli process with base measure B: Xd \ B ~ BeP( B). 
Example realizations are shown in Figure 20.4(a). 

The characteristics of this beta-Bernoulli process define desirable traits for a Bayesian nonpara- 
metric featural model: we have a countably infinite collection of coin-tossing probabilities (one for 
each of our infinite number of features) defined by the beta process, but only a sparse, finite subset 
are active in any Bernoulli process realization. In particular, one can show that B has finite expected 
mass implying that there are only a finite number of successes in the infinite coin-flipping sequence 
that define Xd . Likewise, the sparse set of features active in X,j are likely to be similar to those 
of X d' (an independent draw from BeP(B)), though variability is clearly possible. Finally, the beta 
process is conjugate to the Bernoulli process (Kim, 1999), which implies that one can analytically 
marginalize the latent random beta process measure B and examine the predictive distribution of 
fd given fi, , fd-i and the concentration parameter a. As established by Thibaux and Jordan 
(2007), the marginal distribution on the {fd} obtained from the beta-Bernoulli process is the In- 
dian buffet process (IBP) of Griffiths and Ghahramani (2005), just as the marginalization of the 
Dirichlet-multinomial process yields the Chinese restaurant process. The IBP can be useful in de- 
veloping posterior inference algorithms and a significant portion of the literature is written in terms 
of the IBP representation. 

Returning to the LDA model, one can obtain the focused topic model of Williamson et al. (2010) 
within the beta-Bernoulli process framework as follows: 

B~BP(1,B 0 ) 

X d \B~BeP(B) d=l,...D (20.11) 

7r (d) I fd,P ~ Dir(/?<g> f d ) d=l,...D, 

where Williamson et al. (2010) treat (3 as random according to 3k ~ Gamma( 7 , 1). Here, fd is the 
feature vector associated with X d and Dir(/3 0 f d ) represents a Dirichlet distribution defined solely 
over the components indicated by f d , with hyperparameters the corresponding subset of 3. This 
implies that is a distribution with positive mass only on the sparse set of selected topics. See 
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FIGURE 20.5 

Illustration of generating the sparse document-specific topic distributions tv 1 ' 1 ' 1 via the beta process 
specification. Each document’s binary feature vector fd limits the support of the topic distribution 
to the sparse set of selected topics. The non-zero components are Dirichlet distributed with hyper- 
parmeters given by the corresponding subset of B. See Equation (20.11). 


Figure 20.5. Given Ti^ d \ the zf ' and wf 1 are generated just as in Equation (20.3). As before, we 
take 6k ~ Dir(7yi, . . . , rjy). The graphical model is depicted in Figure 20.2(b). 


20.2 Mixed Membership in Time Series 

Building on the background provided in Section 20.1, we can now explore how ideas of mixed 
membership models can be used in the time series setting. Our particular focus is on time series that 
can be well described using regime-switching models. For example, stock returns might be modeled 
as switches between regimes of volatility or an EEG recording between spiking patterns dependent 
on seizure type. For the exercise routines scenario, people switch between a set of actions such as 
jumping jacks, side twists, and so on. In this section, we present a set of regime-switching models 
for describing such datasets, and show how one can interpret the models as providing a form of 
mixed membership for time series. 

To form the mixed membership interpretation, we build off of the canonical example of FDA 
from Section 20. 1.2. Recall that for FDA, the entity of interest is a document and the set of attributes 
are the possible topics. Each document is then modeled as having membership in multiple topics 
(i.e., mixed membership). For time series analysis, the equivalent analogy is that the entity is the time 
series {y t : t = 1, . . . , T}, which we denote compactly by y 1:T . Just as a document is a collection 
of observed words , a time series is a sequence of observed data points of various forms depending 
upon the application domain. We take the attributes of a time series to be the collection of dynamic 
regimes (e.g., jumping jacks, arm circles, etc.). Our mixed membership time series model associates 
a single time series with a collection of dynamic regimes. However, unlike in text analysis, it is 
unreasonable to assume a bag-of-words representation for time series since the ordering of the data 
points is fundamental to the description of each dynamic regime. 

The central defining characteristics of a mixed membership time series model are (i) the model 
used to describe each dynamic regime, and (ii) the model used to describe the switches between 
regimes. In Section 20.2.1 and in Section 20.2.2 we choose one switching model and explore multi- 
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pie choices for the dynamic regime model. Another interesting question explored in Section 20.2.3 
is how to jointly model multiple time series. This question is in direct analogy to the ideas behind 
the analysis of a corpus of documents in LDA. 

20.2.1 Markov Switching Processes as a Mixed Membership Model 

A flexible yet simple regime-switching model for describing a single time series with such pat- 
terned behaviors is the class of Markov switching processes. These processes assume that the time 
series can be described via Markov transitions between a set of latent dynamic regimes which are 
individually modeled via temporally independent or linear dynamical systems. Examples include 
the hidden Markov model (HMM), switching vector autoregressive (VAR) process, and switching 
linear dynamical system (SLDS). 1 These models have proven useful in such diverse fields as speech 
recognition, econometrics, neuroscience, remote target tracking, and human motion capture. 

Hidden Markov Models 

The hidden Markov model, or HMM, is a class of doubly stochastic processes based on an underly- 
ing, discrete-valued state sequence that is modeled as Markovian (Rabiner, 1989). Conditioned on 
this state sequence, the model assumes that the observations, which may be discrete or continuous 
valued, are independent. Specifically, let z t denote the state, or dynamic regime, of the Markov chain 
at time t and let irj denote the state-specific transition distribution for state j. Then, the Markovian 
structure on the state sequence dictates that 


Zt | Zt-i ~ ttzt-f (20.12) 

Given the state Zt, the observation y t is a conditionally independent emission 

Vt I {8j},zt ~ F(0 Zt ) (20.13) 

for an indexed family of distributions F(-). Here, 0 3 are the emission parameters for state j. 

A Bayesian specification of the HMM might further assume 

x* -DirOSi,...,#*-) 9j ~ H (20.14) 

independently for each HMM state j = 1 , ,I\. 

The HMM represents a simple example of a mixed membership model for time series: a given 
time series ( entity ) is modeled as having been generated from a collection of dynamic regimes 
(i attributes ), each with different mixture weights. The key component of the HMM, which differs 
from standard mixture models such as in LDA, is the fact that there is a Markovian structure to 
the assignment of data points to mixture components (i.e., dynamic regimes). In particular, the 
probability that observation y t is generated from the dynamic regime associated with state j (via an 
assignment z t = j ) is dependent upon the previous state z t -t . As such, the mixing proportions for 
the time series are defined by the transition matrix P with rows tt v This is in contrast to the LDA 
model in which the mixing proportions for a given document are simply captured by a single vector 
of weights. 

Switching VAR Processes 

The modeling assumption of the HMM that observations are conditionally independent given the 
latent state sequence is often insufficient in capturing the temporal dependencies present in many 
datasets. Instead, one can assume that the observations have conditionally linear dynamics. The la- 
tent HMM state then models switches between a set of such linear models in order to capture more 

1 These processes are sometimes referred to as Markov jump-linear systems (MJLS) within the control theory community. 
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complex dynamical phenomena. We restrict our attention in this chapter to switching vector autore- 
gressive (VAR) processes, or autoregressive HMMs (AR-HMMs), which are broadly applicable in 
many domains while maintaining a number of simplifying properties that make them a practical 
choice computationally. 

We define an AR-HMM, with switches between order-?’ vector autoregressive processes, 2 as 

r 

y t =^2 A i,z t y t -i + e t( z t), (20.15) 

i=l 

where z t represents the HMM latent state at time t, and is defined as in Equation (20.12). The 
state-specific additive noise term is distributed as e t (zt ) ~ 7V(0, E Zt ). We refer to A k = 
{Ai t k, ■ ■ ■ , A r k} as the set of lag matrices. Note that the standard HMM with Gaussian emissions 
arises as a special case of this model when A k = 0 for all k. 

20.2.2 Hierarchical Dirichlet Process HMMs 

In the HMM formulation described so far, we have assumed that there are K possible different 
dynamical regimes. This begs the question: what if this is not known, and what if we would like 
to allow for new dynamic regimes to be added as more data are observed? In such scenarios, an 
attractive approach is to appeal to Bayesian nonparametrics. Just as the hierarchical Dirchlet process 
(HDP) of Section 20.1.3 allowed for a collection of countably infinite topic distributions to be 
defined over the same set of topic parameters, one can employ the HDP to define an HMM with a set 
of countably infinite transition distributions defined over the same set of HMM emission parameters. 

In particular, the HDP-HMM of Teh et al. (2006) defines 

P ~ GEM( 7 ) ttj I p ~ DP{ap) ( 9j ~ H. (20.16) 

The evolution of the latent state Zt and observations y t are just as in Equations (20.12) and (20.13). 
Informally, the Dirichlet process part of the HDP allows for this unbounded state-space and en- 
courages the use of only a spare subset of these HMM states. The hierarchical layering of Dirichlet 
processes ties together the state-specific transition distribution (via /3), and through this process, 
creates a shared sparse state-space. 

The induced predictive distribution for the HDP-HMM state Zt, marginalizing the transition 
distributions 7T,, is known as the infinite HMM urn model (Beal et al., 2002). In particular, the HDP- 
HMM of Teh et al. (2006) provides an interpretation of this urn model in terms of an underlying col- 
lection of linked random probability measures. However, the HDP-HMM omits the self-transition 
bias of the infinite HMM and instead assumes that each transition distribution tt :) is identical in 
expectation (E[ttjk \ P] = Pk X implying that there is no differentiation between self-transitions 
and moves between different states. When modeling data with state persistence, as is common in 
most real-world datasets, the flexible nature of the HDP-HMM prior places significant mass on state 
sequences with unrealistically fast dynamics. 

To better capture state persistence, the sticky HDP-HMM of Fox et al. (2008; 201 lb) restores 
the self-transition parameter of the infinite HMM of Beal et al. (2002) and specifies 

p ~ GEM (7) ttj | p~DP{ap + K 6j) 0j ~ H, (20.17) 

where ( ap + nSj) indicates that an amount k > 0 is added to the jth component of ap. In expec- 
tation, 

aP k + n8{j, k) 

,k\ - 



1 We denote an order-r VAR process by VAR(r). 


a + n 


(20.18) 
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FIGURE 20.6 

Graphical model of (a) the sticky HDP-HMM and (b) an HDP-based AR-HMM. In both cases, the 
state evolves as z t + 1 | {rife}, Zt ~ n Zt , where 717 \ /3 ~ DP(a/3 + n6k) and (3 ~ GEM(7). For the 
sticky HDP-HMM, the observations are generated as y t \ { 0f : } . z t ~ F(0 Zt ) whereas the HDP-AR- 
HMM assumes conditionally VAR dynamics as in Equation (20.15), specifically in this case with 
order r = 2. 

Here, S(j, k) is the discrete Rronecker delta. From Equation (20.18), we see that the expected tran- 
sition distribution has weights which are a convex combination of the global weights defined by /? 
and state-specific weight defined by the sticky parameter k. When k = 0, the original HDP-HMM 
of Teh et al. (2006) is recovered. The graphical model for the sticky HDP-HMM is displayed in 
Figure 20.6(a). 

One can also consider sticky HDP-HMMs with Dirichlet process mixture of Gaussian emis- 
sions (Fox et al., 201 lb). Recently, HMMs with Dirichlet process emissions were also considered 
in Yau et al. (2011), along with efficient sampling algorithms for computations. Building on the 
sticky HDP-HMM framework, one can similarly consider HDP-based variants of the switching 
VAR process and switching linear dynamical system, such as represented in Figure 20.6(b); see Fox 
et al. (2011a) for further details. For the HDP-AR-HMM , Fox et al. (2011a) consider methods that 
allow for switching between VAR processes of unknown and potentially variable order. 

20.2.3 A Collection of Time Series 

In the mixed membership time series models considered thus far, we have assumed that we are 
interested in the dynamics of a single (potentially multivariate) time series. However, as in LDA 
where one assumes a corpus of documents, in a growing number of fields the focus is on making 
inferences based on a collection of related time series. One might monitor multiple financial indices, 
or collect EEG data from a given patient at multiple non-contiguous epochs. Recalling the exercise 
routines example, one might have a dataset consisting of multiple time series obtained from multiple 
individuals, each of whom performs some subset of exercise types. In this scenario, we would like 
to take advantage of the overlap between individuals, such that if a “jumping jack” behavior is 
discovered in the time series for one individual then it can be used in modeling the data for other 
individuals. More generally, one would like to discover and model the dynamic regimes that are 
shared among several related time series. The benefits of such joint modeling are twofold: we may 
more robustly estimate representative dynamic models in the presence of limited data, and we may 
also uncover interesting relationships among the time series. 

Recall the basic finite HMM of Section 20.2.1 in which the transition matrix P defined the 
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dynamic regime mixing proportions for a given time series. To develop a mixed membership model 
for a collection of time series, we again build on the LDA example. For LDA, the document-specific 
mixing proportions over topics are specified by Analogously, for each time series y[ d x d , we 
denote the time-series specific transition matrix as PG) with rows ~ { ' 1 ' . That is, for time series d, 
7 r| c,) denotes the transition distribution from state j to each of the K possible next states. Just as 
LDA couples the document-specific topic distributions n {d > under a common Dirichlet prior, we 
can couple the rows of the transition matrix as 

7rj d) ~ Dir(/3r, . , . ,Pk)- (20.19) 

A similar idea holds for extending the HDP-HMM to collections of time series. In particular, we 
can specify 


13 ~ GEM( 7 ) 7rj d) | /3 ~ DP(a/3) . (20.20) 

Analogously to LDA, both the finite and infinite HMM specifications above imply that the expected 
transition distributions are identical between time series (_E[7r^ | 0\ = E[ 7rj d f | /3]). Here, how- 
ever, the expected transition distributions are also identical between rows of the transition matrix. 

To allow for state-specific variability in the expected transition distribution, one could similarly 
couple sticky HDP-HMMs, or consider a finite variant of the model via the weak-limit approxima- 
tion (see Fox et al. (2011b) for details on finite truncations). Alternatively, one could independently 
center each row of the time-series-specific transition matrix around a state-specific distribution. For 
the finite model, 

Ttf I Pj -Dir^!,...,^). (20.21) 

For the infinite model, such a specification is more straightforwardly presented in terms of the 
Dirichlet random measures. Let G',' 11 = Sg k , with i r( rf) the time-series-specific transition 

distribution and Ok the set of HMM emission parameters. Over the collection of D time series, we 
center G^\ . . . , Gj D ^ around a common state-j-specific transition measure G^°\ Then, each of the 

infinite collection of state-specific transition measures : , G^\ ... are centered around a global 
measure Go- Specifically, 

G 0 ~ DP( 7 7T) Gf | Go ~ DP(j7G 0 ) | Gf ~ DP(aGf } ) . (20.22) 

Such a hierarchy allows for more variability between the transition distributions than the specifica- 
tion of Equation (20.20) by only directly coupling state-specific distributions between time series. 
The sharing of information between states occurs at a higher level in the latent hierarchy (i.e., one 
less directly coupled to observations). 

Although they are straightforward extensions of existing models, the models presented in this 
section have not been discussed in the literature to the best of our knowledge. Instead, typical models 
for coupling multiple time series, each modeled via an HMM, rely on assuming exact sharing of the 
same transition matrix. (In the LDA framework, that would be equivalent to a model in which 
every document d shared the same topic weights, nG) = 7r 0 .) With such a formulation, each time 
series ( entity ) has the exact same mixed membership with the global collection of dynamic regimes 
(i attributes ). 

Alternatively, models have been proposed in which each time series d is hard-assigned to one of 
some M distinct HMMs, where each HMM is comprised of a unique set of states and corresponding 
transition distributions and emission parameters. For example, Qi et al. (2007) and Lennox et al. 
(2010) examine a Dirichlet process mixture of HMMs, allowing M to be unbounded. Based on a 
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fixed assignment of time series to some subset of the global collection of HMMs, this model reduces 
to M' examples of exact sharing of HMM parameters, where M' is the number of unique HMMs 
assigned. That is, there are M' clusters of time series with the exact same mixed membership among 
a set of attributes (i.e., dynamic regimes) that are distinct between the clusters. 

By defining a global collection of dynamic regimes and time-series-specific transition distri- 
butions, the formulations proposed above instead allow for commonalities between parameteriza- 
tions while maintaining time-series-specific variations in the mixed membership. These ideas more 
closely mirror the LDA mixed membership story for a corpus of documents. 

The Beta-Bernoulli Process HMM 

Analogously to HDP-LDA, the HDP-based models for a collection of for a single) time series as- 
sume that each time series has membership with an infinite collection of dynamic regimes. This is 
due to the fact that each transition distribution has positive mass on the countably infinite col- 
lection of dynamic regimes. In practice, just as a finite-length document is comprised of a finite set 
of instantiated topics, a finite-length time series is described by a limited set of dynamic regimes. 
This limited set might be related yet distinct from the set of dynamic regimes present in another 
time series. For example, in the case of the exercise routines, perhaps one observed individual per- 
forms jumping jacks, side twists, and arm circles, whereas another individual performs jumping 
jacks, arm circles, squats, and toe touches. In a similar fashion to the feature-based approach of 
the focused topic model described in Section 20.1.3, one can employ the beta-Bernoulli process to 
directly capture a sparse set of associations between time series and dynamic regimes. 

The beta process framework provides a more abstract and flexible representation of Bayesian 
nonparametric mixed membership in a collection of time series. Globally, the collection of time se- 
ries are still described by a shared library of infinitely many possible dynamic regimes. Individually, 
however, a given time series is modeled as exhibiting some sparse subset of these dynamic regimes. 

More formally. Fox et al. (2010) propose the following specification: each time series d is en- 
dowed with an infinite-dimensional feature vector fd = [fdi, fd 2 , ■ ■ •], with fdj = 1 indicating 
the inclusion of dynamic regime j in the membership of time series d. The feature vectors for the 
collection of D time series are coupled under a common beta process measure B ~ BP(c, /><, ) . 
In this scenario, one can think of B as defining coin-flipping probabilities for the global collec- 
tion of dynamic regimes. Each feature vector fd is implicitly modeled by a Bernoulli process draw 
Xd | B ~ BeP(f?) with X d = fdk&e k - That is, the beta-process-determined coins are flipped 
for each dynamic regime and the set of resulting heads indicate the set of selected features (i.e., via 
fdk = 1 ). 

The beta process specification allows flexibility in the number of total and time-series-specific 
dynamic regimes, and encourages time series to share similar subsets of the infinite set of possible 
dynamic regimes. Intuitively, the shared sparsity in the feature space arises from the fact that the total 
sum of coin-tossing probabilities is finite and only certain dynamic regimes have large probabilities. 
Thus, certain dynamic regimes are more prevalent among the time series, though the resulting set of 
dynamic regimes clearly need not be identical. For example, the lower subfigure in Figure 20.4(a) 
illustrates a collection of feature vectors drawn from this process. 

To limit each time series to solely switch between its set of selected dynamic regimes, the feature 
vectors are used to form feature-constrained transition distributions'. 

Ttj d) I fd ~ Dir([ 7 , ..., 7,7 + k, 7 , ...] ® f d ). (20.23) 

Again, we use Dir([ 7 , . . . , 7, 7 + k, 7 , . . . ] ® fd) to denote a Dirichlet distribution defined over the 
finite set of dimensions specified by fd with hyperparameters given by the corresponding subset of 
[ 7 , . . . , 7 , 7 +k, 7 , . . . ]. Here, the k hyperparameter places extra expected mass on the component of 
nj d ' corresponding to a self-transition 7 r analogously to the sticky hyperparameter of the sticky 
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HDP-HMM (Fox et al., 2011b). This construction implies that : has only a finite number of 
non-zero entries . As an example, if 


f d = [1 o 0 1 1 0 1 0 0 0 • • •] , 


then 


TT^ = 


7 rj? 0 0 tt $ 0 0 0 0- 


with 


fd') ^( d ) „r( d ) „r( d ) 


'j i 




j'7 


distributed according to a four-dimensional Dirichlet distribution. Pic- 
torially, the generative process of the feature-constrained transition distributions is similar to that 
illustrated in Figure 20.5. 

Although the methodology described thus far applies equally well to HMMs and other Markov 
switching processes. Fox et al. (2010) focus on the AR-HMM of Equation (20.15). Specifically, 


let y| ,,; represent the observed value of the r/th time series at time t, and let zf denote the latent 
dynamical regime. Assuming an order-r AR-HMM, we have 


(d) 


Z t - 1 


y t 


id) 


_ A .,( d ) I Jd), (d), 
- 2^ A j,zi d)y t-j +e t {z t ) 


(20.24) 


3 = 1 


where e[ d \k) ~ £&). Recall that each of the 9k = {Ak, £fc} defines a different VAR(?’) 

dynamic regime and the feature-constrained transition distributions tt rf) restrict time series d to 
transition among dynamic regimes (indexed at time t by z[ d ■*) for which it has membership, as 
indicated by its feature vector fi. 

Conditioned on the set of D feature vectors fi coupled via the beta-Bernoulli process hierarchy, 
the model reduces to a collection of D switching VAR processes, each defined on the finite state- 
space formed by the set of selected dynamic regimes for that time series. Importantly, the beta- 
process-based featural model couples the dynamic regimes exhibited by different time series. Since 
the library of possible dynamic parameters is shared by all time series, posterior inference of each 
parameter set 9k relies on pooling data among the time series that have f d k = 1. It is through this 
pooling of data that one may achieve more robust parameter estimates than from considering each 
time series individually. 

The resulting model is termed the BP-AR-HMM, with a graphical model representation pre- 
sented in Figure 20.7. The overall model specification is summarized as: 3 


B ~ BP(1, B 0 ) 

X d | B~BeP(B), d = 1, . . . ,D 

7r- d) | fd ~ Dir([ 7 , .. .,7,7 + k, 7, . . .] <g> f d ), d = 1, . . . , D, j = 1, 2, . . . 


3 

4 d) I 4-1 ~ 4f d) , d = 1, . . . , D, t = 1, . . . , T d 


(20.25) 


y t =^Z A j, z wyt-j+ e t(4 d) ), d=l,...,D, t = 1, . . . , T d . 


j = i 


3 One could consider alternative specifications for ft = [ 7 , . . . , 7, 7 + K, 7 ] such as in the focused topic model of (20. 11) 
where each element (3^ is an independent random variable. Note that Fox et al. (2010) treat 7 , K as random. 
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FIGURE 20.7 

Graphical model of the BP-AR-HMM. The beta process distributed measure B \ B 0 ~ BP(1, B 0 ) 
is represented by its masses u>k and locations 9^, as in Equation (20.8). The features are then con- 
ditionally independent draws fdk \ ~ Bernoulli (a;;,:), and are used to dehne feature-constrained 

transition distributions \ fd ~ Dir([ 7 , . . . , 7, 7 + k, 7 , . . . ] (g> fd). The switching VAR dynam- 
ics are as in Equation (20.24). 


Fox et al. (2010) apply the BP-AR-HMM to the analysis of multiple motion capture (MoCap) 
recordings of people performing various exercise routines, with the goal of jointly segmenting and 
identifying common dynamic behaviors among the recordings. In particular, the analysis examined 
six recordings taken from the CMU database (CMU, 2009), three from Subject 13 and three from 
Subject 14. Each of these routines used some combination of the following motion categories: run- 
ning in place, jumping jacks, arm circles, side twists, knee raises, squats, punching, up and down, 
two variants of toe touches, arch over, and a reach-out stretch. 

The resulting segmentation from the joint analysis is displayed in Figure 20.8. Each skeleton 
plot depicts the trajectory of a learned contiguous segment of more than two seconds, and boxes 
group segments categorized under the same behavior label in the posterior. The color of the box 
indicates the true behavior label. From this plot we can infer that although some true behaviors 
are split into two or more categories (“knee raises” [green] and “running in place” [yellow ]), 4 the 
BP-AR-HMM is able to find common motions (e.g., six examples of “jumping jacks” [magenta]) 
while still allowing for various motion behaviors that appeared in only one movie (bottom left four 
skeleton plots.) 

The key characteristic of the BP-AR-HMM that enables the clear identification of shared versus 
unique dynamic behaviors is the fact that the model takes a feature-based approach. The true feature 
matrix and BP-AR-HMM estimated matrix, averaged over a large collection of MCMC samples, are 
shown in Figure 20.9. Recall that each row represents an individual recording’s feature vector fd 
drawn from a Bernoulli process, and coupled under a common beta process prior. The columns 
indicate the possible dynamic behaviors (truncated to a finite number if no assignments were made 
thereafter.) 


4 The split behaviors shown in green and yellow correspond to the true motion categories of knee raises and running, 
respectively, and the splits can be attributed to the two subjects performing the same motion in a distinct manner. 
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FIGURE 20.8 

Each skeleton plot displays the trajectory of a learned contiguous segment of more than two seconds, 
bridging segments separated by fewer than 300 msec. The boxes group segments categorized under 
the same behavior label, with the color indicating the true behavior label (allowing for analysis 
of split behaviors). Skeleton rendering done by modifications to Neil Lawrence’s Matlab MoCap 
toolbox (Lawrence, 2009). 



FIGURE 20.9 

Feature matrices associated with the true MoCap sequences (left) and BP-AR-HMM estimated se- 
quences over iterations 15,000 to 20,000 of an MCMC sampler (right). Each row is an individual 
recording and each column a possible dynamic behavior. The white squares indicate the set of se- 
lected dynamic behaviors. 
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20.3 Related Bayesian and Bayesian Nonparametric Time Series Models 

In addition to the regime-switching models described in this chapter, there is large and growing 
literature on Bayesian parametric and nonparametric time series models, many of which also have 
interpretations as mixed membership models. We overview some of this literature in this section, 
aiming not to cover the entirety of related literature but simply to highlight three main themes: (i) 
non-homo geneous mixed membership models, and relatedly, time-dependent processes , (ii) other 
HMM-based models, and fiii) time-independent mixtures of autoregressions. 

20.3.1 Non-Homogeneous Mixed Membership Models 
Time- Varying Topic Models 

The documents in a given corpus sometimes represent a collection spanning a wide range of time. It 
is likely that the prevalence and popularity of various topics, and words within a topic, change over 
this time period. For example, when analyzing scientific articles, the set of scientific questions being 
addressed naturally evolves. Likewise, within a given subfield, the terminology similarly develops — 
perhaps new words are created to describe newly discovered phenomena or other words go out of 
vogue. 

To capture such changes, Blei and Lafferty (2006) proposed a dynamic topic model. This model 
takes the general framework of LDA, but specifies a Gaussian random walk on a set of topic-specific 
word parameters 


0t,k 1 ~ (J 2 /) 

(20.26) 

and document-specific topic parameters 


ftt | ftt-i ~ Af(ftt-uS 2 I). 

(20.27) 


The topic-specific word distribution arises via n(6k,t,w) = ) ' ^ or *^ e topic distribu- 

tion, Blei and Lafferty (2006) specify ~ AT(/3 t ,a 2 I) and transform to n (rj). This formulation 
provides a non-homogeneous mixed membership model since the membership weights (i.e., topic 
weights) vary with time. 

The formulation of Blei and Lafferty (2006) assumes discrete, evenly spaced corpora of doc- 
uments. Often, however, documents are observed at uneven and potentially finely-sampled time 
points. Wang et al. (2008) explore a continuous time extension by modeling the evolution of Of j. as 
Brownian motion. As a simplifying assumption, the authors do not consider evolution of the global 
topic proportions ft. 

Time-Dependent Bayesian Nonparametric Processes 

For Bayesian nonparamettic time-varying topic modeling, Srebro and Roweis (2005) propose a 
time -dependent Dirichlet process. The Dirichlet process allows for an infinite set of possible topics, 
in a similar vein to the motivation in HDP-LDA. Importantly, however, this model does not assume 
a mixed membership formulation and instead takes each document to be hard-assigned to a single 
topic. The proposed time-dependent Dirichlet process models the changing popularity of various 
topics, but assumes that the topic-specific word distributions are static. That is, the Dirichlet process 
probability measures have time-varying weights, but static atoms. 

More generally, there is a growing interest in time-dependent Bayesian nonparmetric processes. 
The dependent Dirichlet process was originally proposed by MacEachern (1998). A substantial fo- 
cus has been on evolving the weights of the random discrete probability measures. Recently, Griffin 
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and Steel (201 1) examine a general class of autoregressive stick-breaking processes, and Mena et al. 
(201 1) study stick-breaking processes for continuous-time modeling. Taddy (2010) considers an al- 
ternative autoregressive specification for Dirichlet process stick-breaking weights, with application 
to modeling the changing rate function in a dynamic spatial Poisson process. 

20.3.2 Hidden-Markov-Based Bayesian Nonparametric Models 

A number of other Bayesian nonparametric models have been proposed in the literature that 
take as their point of departure a latent Markov switching mechanism. Both the infinite factorial 
HMM (Van Gael et al., 2008) and the infinite hierarchical HMM (Heller et al., 2009) provide 
Bayesian nonparametric priors for infinite collections of latent Markov chains. The infinite fac- 
torial HMM provides a distribution on binary Markov chains via a Markov Indian buffet process. 
The implicitly defined time-varying infinite -dimensional binary feature vectors are employed in 
performing blind source separation (e.g., separating an audio recording into a time-varying set of 
overlapping speakers.) The infinite hierarchical HMM also employs an infinite collection of Markov 
chains, but the evolution of each depends upon the chain above. Instead of modeling binary Markov 
chains, the infinite hierarchical HMM examines finite multi-class state-spaces. 

Another method that is based on a finite state-space is that of Taddy and Kottas (2009). The 
proposed model assumes that each HMM state defines an independent Dirichlet process regression. 
Extensions to non-homogenous Markov processes are considered based on external covariates that 
inform the latent state. 

In Saeedi and Bouchard-Cote (2012), the authors propose a hierarchical gamma-exponential 
process for modeling recurrent continuous time processes. This framework provides a continuous- 
time analog to the discrete-time sticky HDP-HMM. 

Instead of Markov-based regime-switching models that capture repeated returns to some (possi- 
bly infinite) set of dynamic regimes, one can consider changepoint methods in which each transition 
is to a new dynamic regime. Such methods often allow for very efficient computations. For exam- 
ple, Xuan and Murphy (2007) base such a model on the product partition model 5 framework to 
explore changepoints in the dependency structure of multivariate time series, harnessing the ef- 
ficient dynamic programming techniques of Fearnhead (2006). More recently, Zantedeschi et al. 
(201 1) explore a class of dynamic product partition models and online computations for predicting 
movements in the term structure of interest rates. 

20.3.3 Bayesian Mixtures of Autoregressions 

In this chapter, we explored two forms of switching autoregressive models: the HDP-AR-HMM and 
the BP-AR-HMM. Both models assume that the switches between autoregressive parameters fol- 
low a discrete-time Markov process. There is also substantial literature on nonlinear autoregressive 
modeling via mixtures of autoregressive processes, where the mixture components are indepen- 
dently selected over time. Fau and So (2008) consider a Dirichlet process mixture of autoregres- 
sions. That is, at each time step the observation is modeled as having been generated from one of an 
unbounded collection of autoregressive processes, with the mixing distribution given by a Dirich- 
let process. A variational approach to Dirichlet process mixtures of autoregressions with unknown 
orders has recently been explored in Morton et al. (2011). Wood et al. (2011) aim to capture the 
idea of structural breaks by segmenting a time series into contiguous blocks of L observations and 
assigning each segment to one of a finite mixture of autoregressive processes; implicitly, all L ob- 
servations are associated with a given mixture component. Key to the formulation is the inclusion 
of time-varying mixture weights, leading to a nonstationary process, as in Section 20.3.1. 

3 A product partition model is a model in which the data are assumed independent across some set of unknown parti- 
tions (Hartigan, 1990; Barry and Hartigan, 1992). The Dirichlet process is a special case of a product partition model. 
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As an alternative formulation that captures Markovianity, but not directly in the latent mixture 
component, Muller et al. (1997) consider a model in which the probability of choosing a given 
autoregressive component is modeled via a kernel based on the previous set of observations (and 
potential covariates). The maximal set of K mixture components is fixed, with the associated au- 
toregressive parameters taken to be draws from a Dirichlet process, implying that only k < K will 
take distinct values. 


20.4 Discussion 

In this chapter, we have discussed a variety of time series models that have interpretations in the 
mixed membership framework. Mixed membership models are comprised of three key components: 
entities, attributes, and data. What differs between mixed membership models is the type of data 
associated with each entity, and how the entities are assigned membership with the set of possible 
attributes. Abstractly, in our case each time series is an entity that has membership with a collection 
of dynamic regimes, or attributes. The partial memberships are determined based on the temporally 
structured observations, or data, for the given time series. This structured data is in contrast to the 
typical focus of mixed membership models on exchangeable collections of data per entity (e.g., a 
bag-of-words representation of a document’s text.) 

Throughout the chapter, we have focused our attention on the class of Markov switching pro- 
cesses, and further restricted our exposition to Bayesian parametric and nonparametric treatments 
of such models. The latter allows for an unbounded set of attributes by modeling processes with 
Markov transitions between an infinite set of dynamic regimes. For the class of Markov switching 
processes, the mixed membership of a given time series is captured by the time-series-specific set 
of Markov transition distributions. Examples include the classical hidden Markov model (HMM), 
autoregressive HMM, and switching state-space model. In mixed membership modeling, one typ- 
ically has a group of entities (e.g., a corpus of documents) and the goal is to allow each entity 
to have a unique set of partial memberships among a shared collection of attributes (e.g., topics). 
Through such modeling techniques, one can efficiently and flexibly share information between the 
data sources associated with the entities. Motivated by such goals, in this chapter we explored a 
nontraditional treatment of time series analysis by examining models for collections of time series. 
We proposed a Bayesian nonparametric model for multiple time series based on ideas analogous to 
Dirichlet-multinomial modeling of documents. We also reviewed a Bayesian nonparametric model 
based on a beta-Bernoulli framework that directly allows for sparse association of time series with 
dynamic regimes. Such a model enables decoupling the presence of a dynamic regime from its 
prevalence. 

The discussion herein of time series analysis from a mixed membership perspective has been 
previously neglected, and leads to interesting ideas for further development of time series models. 
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A mixed membership model is an individual-level mixture model where individuals have partial 
membership of the profiles that characterize a population. A mixed membership model for rank 
data is outlined and illustrated through the analysis of voting in the 2002 Irish general election. 
This particular election uses a voting system called Proportional Representation using a Single 
Transferable Vote (PR-STV), where voters rank some or all of the candidates in order of preference. 
The dataset considered consists of all votes in a constituency from the 2002 Irish general election. 
Interest lies in highlighting distinct voting profiles within the electorate and studying how voters 
affiliate themselves to these voting profiles. The mixed membership model for rank data is fitted to 
the voting data and is shown to give a concise and highly interpretable explanation of voting patterns 
in this election. 
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21.1 Introduction 

Mixture models are a well-established tool for model-based clustering of data (McLachlan and Bas- 
ford, 1988; Fraley and Raftery, 2002). Mixture models describe a population as a finite collection of 
homogeneous groups, each of which is characterized by a specific probability density. While based 
on a similar concept, mixed membership, or Grade of Membership (GoM), models allow every in- 
dividual to have partial membership in each of the profiles that characterize the population. Thus, 
mixed membership models provide a method for model-based soft clustering of data. The mixed 
membership (or GoM) model for multivariate categorical data is developed in Erosheva (2002) and 
Blei et al. (2003), and this model has been used in a number of applications including Erosheva 
et al. (2004; 2007) and Airoldi et al. (2010), amongst others. 

Rank data arise when a set of judges rank some (or all) of a set of objects. Rank data emerge in 
many areas of society; the final ordering of athletes in a race, league tables, the ranking of relevant 
results by internet search engines, and consumer preference data provide examples of such data. In 
this chapter, a mixed membership model for rank data that was originally developed in Gormley and 
Murphy (2009) is described and applied to the problem of finding structure in Irish voting data. 

The Irish electoral system uses a voting system called Proportional Representation using a Sin- 
gle Transferable Vote (PR-STV). In this system, voters rank some or all of the candidates in order 
of preference. When drawing inferences from such data, the information contained in the different 
preference levels must be exploited by the use of appropriate modeling tools. An illustration of the 
mixed membership model for rank data methodology is provided through an examination of vot- 
ing data from the 2002 Irish general election. Interest lies in highlighting voting profiles that occur 
within the electorate. The mixed membership model provides the scope to examine if and how vot- 
ers exhibit mixed membership by sharing preference behavior described by more than one of these 
voting profiles. 

A latent class representation of the mixed membership model for rank data is used for model 
fitting within the Bayesian paradigm. A Metropolis-within-Gibbs sampler is necessary to provide 
samples from the posterior distribution. Model selection is achieved using the deviance information 
criterion (DIC) and the adequacy of model fit is assessed using posterior predictive checks. 

The chapter proceeds as follows: Section 21.2 outlines the Irish voting system and the details 
surrounding the 2002 Irish general election. We employ the Plackett-Luce model for rank data in 
this application as the rank data model; we discuss this model in Section 21.3.1. The specification 
of the mixed membership model for rank data follows in Section 21.3.2. Estimation of the mixed 
membership model for rank data is outlined in Section 21.4.1. Section 21.4.2 addresses the question 
of model choice. We present the application of the mixed membership model for rank data in the 
2002 Irish general election data in Section 21.5. The chapter concludes in Section 21.6 with a 
discussion of the methodology. 


21.2 The 2002 Irish General Election 

Dail Eireann is the main parliament in the Republic of Ireland; it has 166 members. Members (called 
Teachtal Dala or TDs) are elected to the Dail through a general election which must take place at 
least every five years. On May 17, 2002, a general election was held to elect the 29th Dail; candi- 
dates ran in 42 constituencies. Each constituency elected either three, four, or five candidates, where 
the number of candidates to be elected is determined by the population of the constituency. The 
Ceann Comhairle is the position of Speaker of the House in Dail Eireann. The Ceann Comhairle 
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from the previous parliament is automatically re-elected in their constituency and thus the number 
of candidates elected through the general election in that constituency is reduced by one. The out- 
going government consisted of a Fianna Fail and Progressive Democrat coalition with Fianna Fail 
having 77 seats and the Progressive Democrats having 4 seats. Thus, the outgoing government was 
a minority government who relied on a number of independent TDs for support. After the election, 
a coalition government involving Fianna Fail and the Progressive Democrats was formed again, this 
time with a majority holding 81 and 8 seats, respectively. This was the first time that a govern- 
ment had been re-elected in an Irish general election in 30 years. Extensive descriptions of the 2002 
election are provided by Kennedy (2002), Weeks (2002), Gallagher et al. (2003), and Marsh (2003). 

In the 2002 general election, a trial was conducted in three constituencies where electronic 
voting was introduced: Dublin North, Dublin West, and Meath. The voting data from these three 
constituencies was made publicly available providing an unprecedented insight into the voting in 
Irish elections beyond what had previously been available in poll data. The data from the Dublin 
North constituency was analyzed because it contained a particularly diverse range of candidates and 
thus the data was expected to contain interesting voting behavior. 

In 2002, the Dublin North constituency consisted of an electorate of 72,353 with four TDs 
to be elected from this constituency. A total of 43,942 people voted and twelve candidates ran 
for election: Fianna Fail, the largest political party at the time, ran three candidates; Fine Gael, 
the largest opposition party, ran two candidates; the Labour, Green, and Sinn Fein parties ran one 
candidate each, and smaller parties like the Socialist, Christian Solidarity, and Independent Health 
Alliance parties also ran one candidate each. One independent candidate ran for election and the 
Progressive Democrats did not run any candidate in Dublin North. Four of the candidates were 
incumbent candidates from the 28th Dail; however, Sean Ryan (Labour) was elected to the 28th 
Dail through a by-election after the resignation of Ray Burke (Fianna Fail) from his seat during the 
28th Dail. 

The votes in the election were totaled through a series of counts where candidates are elimi- 
nated, their votes are distributed, and surplus votes are transferred between candidates. A detailed 
introduction to the PR-STV voting system in an Irish context is given in Sinnott (1999) and a good 
overall comparison of different voting systems is given by Farrell (2001) and Gallagher and Mitchell 
(2005). 

Details of the counting and transfer of votes in the Dublin North constituency are shown in Ta- 
ble 21.1. The total valid poll consisted of 43,942 votes, so the number of votes required to guarantee 
election (called the droop quota) was 8,789. In the first count, the number of first preferences for 
each candidate was counted. If no candidate exceeded the droop quota, then the lowest candidates 
were eliminated and their votes were distributed using the next available preferences on their bal- 
lots; that is, a vote was transferred to the next preferred candidate on the ballot who had not been 
eliminated or already elected; if no such candidate existed then the vote was considered to be non- 
transferable. If a candidate was elected by exceeding the droop quota, then their surplus votes (the 
amount by which they exceed the droop quota) were distributed using the next available prefer- 
ences on these surplus votes; the surplus votes to be transferred were sampled from the set of votes 
that brought the candidate over the droop quota. The procedure of eliminating low candidates and 
distributing surpluses continued until either four candidates exceeded the droop quota or only four 
candidates remained. 

For example, Trevor Sargent was the first candidate to be elected; he was elected in round 6 of 
the count because he exceeded the droop quota on the basis of the 7,294 first preference votes and 
2,491 votes that he received through transfers in rounds 1 to 5 of the count. Because he received 
997 votes in excess of the droop quota, these excess votes were transferred in round 7; the 997 votes 
that were distributed were sampled from the 1,667 that he received in round 5, because these votes 
brought his total over the droop quota. 

By the end of the vote count in Dublin North, two candidates reached the droop quota and two 
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TABLE 21.1 

The counting and transfer of votes in the Dublin North constituency in the 2002 Irish general elec- 
tion. The incumbent candidates are marked with an asterisk. The point at which each candidate was 
elected is marked in bold. 


Candidate 

Party 




Round of Count 




(Abbreviation) 


t 

2 

3 

4 

5 

6 

1 

8 

Trevor Sargent* 

Green 

7294 

7380 

7678 

7818 

8118 

9785 

8789 

8789 

(Sa) 



+86 

+298 

+140 

+300 

+1667 

-996 


Sean Ryan* 

Labour 

6359 

6407 

6535 

6665 

6847 

8578 

9128 

9128 

(Ry) 



+48 

+128 

+ 130 

+182 

+1731 

+550 


Jim Glennon 

Fianna Fail 

5892 

5945 

6028 

6152 

6294 

6511 

6598 

8640 

(Gl) 



+55 

+83 

+ 124 

+142 

+217 

+85 

+2044 

G V Wright* 

Fianna Fail 

5658 

5707 

5739 

5777 

5868 

6139 

6249 

8617 

(Wr) 



+49 

+32 

+5S 

+ 91 

+271 

+ 110 

+2368 

Clare Daly 

Socialist 

5501 

5551 

5730 

5796 

6244 

6590 

6772 

7523 

(Dy) 



+55 

+179 

+66 

+448 

+346 

+182 

+751 

Michael Kennedy 

Fianna Fail 

5253 

5309 

5368 

5422 

5532 

5732 

5801 


(Ke) 



+56 

+59 

+54 

+110 

+200 

+69 

-5801 

Nora Owen* 

Fine Gael 

4012 

4030 

4132 

4720 

4763 




(Ow) 



+ 18 

+102 

+588 

+43 

-4763 



Mick Davis 

Sinn Fein 

1350 

1382 

1424 

1440 





(Dv) 



+32 

+42 

+16 

-1440 




Cathal Boland 

Fine Gael 

1177 

1189 

1216 






(Bo) 



+ 12 

+27 

-1216 





Ciaran Goulding 

Independents 

914 

1009 







(Go) 

Health Alliance 


+95 

-1009 






Eamon Quinn 

Independent 

285 








(Qu) 

David Walshe 

Christian 

247 

-285 







(Wa) 

Solidarity Party 


-247 







Non Transferable 



33 

92 

152 

276 

607 

607 

1245 




+55 

+59 

+60 

+124 

+331 


+638 

Total 

43,942 


were elected without reaching the quota. The four candidates elected were also the four candidates 
with the highest number of first preferences, but this does not necessarily happen. 


21.3 Model Specification 

The Dublin North general election voting data possess some unique properties which require careful 
statistical modeling. A mixed membership model can easily accommodate the differing preferences 
that voters may have for the candidates. Although a finite mixture model may be used for the same 
purpose (e.g., Gormley and Murphy, 2008a) the finite mixture model needs a large number of mix- 
ture components to account for the voting behavior exhibited in the electorate; conversely the mixed 
membership model can account for different behavior using a relatively small number of profiles. 
In order to account for the ranked nature of the preference voting data, the Plackett-Luce model for 
rank data is used. 
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Under the PR-STV electoral system, a voter ranks some or all of the candidates in order of prefer- 
ence. In order to appropriately model such data, a model for rank data is required. A large number 
of models for rank data have already been developed (Bradley and Terry, 1952; Mallows, 1957; 
Plackett, 1975), and these are reviewed in Marden (1995). In this study the Plackett-Luce model 
(Plackett, 1975) is utilized to model the rank nature of the data. 

The Plackett-Luce model is parameterized by a ‘support’ parameter 


P= (Pi,P2,---,Pn), 


where N denotes the total number of electoral candidates. Note that 0 <pj < land Y^=iPi = l - 
The parameter p :j has the interpretation of being the probability of candidate j being ranked first 
by a voter. The model assumes that the probability of candidate j being given a lower than first 
preference is proportional to their support parameter pj, but conditional on a smaller number of 
candidates being available for selection at lower preferences. Hence, at preference levels lower than 
the first, the probabilities are re-normalized to provide valid probability values. Further, it can be 
shown that the Plackett-Luce model has a random utility choice model interpretation (Chapman and 
Staelin, 1982). 

Let voter i record the vote x t = {c{i, 1 ),c(i, 2), . . . , c(i, n*)}, where n-i is the number of pref- 
erences expressed by voter i. The Plackett-Luce model states that the probability of vote x ; - is given 
as 


p {Xi\p} 


Hi 

n 


Pc(i,t) 

Pc(i,t ) + Pc(i,t+ 1 ) + ' ' ' + Pc(i,N) 


Hi 

= n 


Pc(i,t ) 

2-^s—t Pc(i,s) 


n > 


t~ i 


( 21 . 1 ) 


where c(i, n, + 1), . . . , c(i, N) is any permutation of the unranked candidates. Note that the proba- 
bility of the ranking is conditional on n,;, the number of preferences expressed, and it can easily be 
shown that (21.1) sums to 1 over all n l ! possible permutations of the candidates ranked in the vote 

Xi- 


21.3.2 The Mixed Membership Model for Rank Data 

Mixed membership models allow every individual in a population to have partial membership in 
each of the profiles that characterize the population; thus, a soft clustering of the population mem- 
bers is achievable. Herein we describe a mixed membership model for rank data as developed by 
Gormley and Murphy (2009). 

Under the mixed membership model, each voter i = 1 , ,M has an associated mixed mem- 
bership parameter = (tth, 7t,2, . . . , 7 Tm) which is a direct parameter of the model. The mixed 
membership parameter tt ; describes the degree of membership of individual i in each of the K pro- 
files which characterize the electorate. Note that 0 < 7r,fc < 1 and ^ fe=1 7 = 1 for i = 1, . . . , M. 
Thus, if individual i is fully characterized by profile k, then iru- = 1 and 7r. i? - = 0 for j ^ k. Addi- 
tionally, if individual i is characterized by profiles 1C C {1, 2, . . . , A'}, then n-ij > 0 for j & 1C and 
7 Tij = 0 for j & 1C. 

The mixed membership model for ranked data is formulated as follows: We assume that the 
probability of voter i ranking candidate j in position t on their ballot is a convex combination of the 
probability of the voter choosing candidate j in position t as described by each profile, where the 
weights in the convex combination are equal to the voter’s mixed membership parameter. That is, 
the probability of voter i choosing candidate j at preference level t, conditional on voter i’s mixed 
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membership parameter 7Tj and the profile specific support parameters p = (p ,p , . . . ,P K ), is given 
as 


K 

P {C{i,t) = j|7Ti, p} = ^TTifc 
k = 1 


Pkj 


l^s—t Pkc(i,s) 


( 21 . 2 ) 


Additionally, local independence is then assumed between each preference level t, given the mixed 
membership parameters. Thus, the conditional probability of ranking x t given membership param- 
eter 7T j and support parameters p is 


n; f K 

p{— z i— i ’ P } = n \ 

t- 1 U=r 

and the likelihood function based on the data x = (x 1 , x 2 , . . . , x M ) i s therefore 



M m f K 

p{xi7r,p} = nil E- 

i=it=i lfc=r 

Note that under the mixed membership model, each voter has partial membership of each profile 
and mixing takes place at each preference level t rather than at the vote level as would be typical 
of a rank data mixture model (Stern, 1993; Murphy and Martin, 2003; Gornrley and Murphy, 2006; 
Busse et al., 2007; Gormley and Murphy, 2008a;b). Modeling rank data in this manner provides a 
deeper insight into the structure within the electorate by allowing mixing to occur at a finer level. 
This is a desirable characteristic as it may be restrictive to assume a voter expresses all preferences 
in their vote as dictated by a single profile; it is likely that a voter may express some preferences 
in line with the support parameters of one profile, and other preferences in line with the support 
parameters of other profiles. This is clearer when we look at the latent class representation of the 
mixed membership model (Section 21.3.2). 


Pkc(i,t ) 


\N 

2^s=i Pkc(i,s) 


A Latent Class Representation of the Mixed Membership Model 

The mixed membership model for rank data can be expressed using a latent class representation in 
a manner similar to Erosheva (2006); this representation facilitates efficient inference for the model 
and it assists with model interpretation. The latent class representation of the mixed membership 
model for rank data involves augmenting the data for each voter i with categorical latent variables 
which record the profile that is used by voter i when recording preference level t,. The discrete dis- 
tribution for the latent classes has a functional form that depends on mixed membership parameters 
7r_j for voter i. 

For each voter i, we impute binary latent vectors z it = (zui, ■ ■ ■ , z-uk) for t = 1, . . . , rii, where 
z it ~ Multinomial ( 1, 7rj). The value of z it records the voting profile that is used by voter i when 
recording preference level t. 

It follows that under the mixed membership model the ‘augmented’ data likelihood function 
based on the data x and the binary latent variables z is therefore of the form 


P{x,z|tt,p} 


m k m 

nnn 


i = 1 k—1 1—1 



Pkc(i,t ) 


L-is—t Pkc(i,s) 



(21.3) 


Employing the latent class representation of the mixed membership model not only allows es- 
timation of the characteristic parameters of each profile but also direct estimation of the mixed 
membership parameter for each voter, thus achieving a soft clustering of the voters. In addition, 
the mixed membership of each individual can be further probed to establish which profile is best 
appropriate for modeling voter i when they are making choice level /. 
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A Bayesian approach is taken when estimating the mixed membership model for rank data and thus 
the specification of prior distributions for the parameters of the model is required. It is assumed that 
the mixed membership parameters follow a Dirichlet(a) distribution and that the support parameters 
follow a Dirichlet(/3) distribution, i.e.. 


n i ~ Dirichlet {a = («i, ct2j • • • > ax)} 

P k ~ Dirichlet {/3 = (ft, /? 2 , • • - , Av)} • 

The conjugacy of the Dirichlet distribution with the multinomial distribution means the use of 
a Dirichlet prior is naturally attractive. The use of a Dirichlet prior does, however, induce a nega- 
tive correlation structure between parameters. The sensitivity of inferences drawn under the mixed 
membership model for rank data to this prior specification is considered in Gormley and Murphy 
(2009). For even moderate sized datasets it was found that the posterior inferences were not heavily 
influenced by the prior specification. In Gormley and Murphy (2009), the sensitivity of the choice 
of prior model and hyperparameters was considered. In practice, the prior parameters are fixed as 
a = (0.5, . . . , 0.5) and /? = (0.5, . . . , 0.5), which is the Jeffreys prior for the multinomial distri- 
bution (e.g., O’Hagan and Forster, 2004). These priors have positive mass near the corners of the 
parameter simplex and thus the posterior distributions of the parameters can have high probability 
in these regions. However, the choice of parameters also avoids the posterior concentrating exactly 
on the corners of the simplex. 

In principle, the prior hyperparameters could be estimated as part of the inference procedure 
rather than fixed as done here, but this greatly increases the computational burden of model fitting 
and inference. 

Given these prior distributions and the augmented data likelihood function (21.3) from the mixed 
membership model for rank data, the posterior distribution based on the data is: 


KN 

YIYIrT 1 ■ 

k= lj=l 

This posterior distribution differs from the posterior distribution in the case of the original mixed 
membership model (Erosheva, 2002; 2003) in the form of the likelihood function. In the original 
mixed membership model, discrete response variables are treated as independent given the mixed 
membership parameters. The likelihood function is therefore the product of independent Bernoulli 
distributions. In the mixed membership model for rank data, however, the dependence of choices 
within a rank response leads to a more complex likelihood function that is the product of terms that 
share parameter values. 


P{tt, p. z|x} 


nnn 


Pkc(i,t ) 

E N 

s — t Pkc(i,s ) 


nn 


21.4 Model Inference 

21.4.1 Parameter Estimation 

The mixed membership model for rank data can be efficiently fitted in a Bayesian framework. Due 
to the structure of the posterior distribution, Markov chain Monte Carlo (MCMC) methods are 
necessary to produce posterior samples of the model parameters. In particular, a Gibbs sampling 
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step can be used in the algorithm if the full conditional distribution for a model parameter has a 
tractable form. For most of the model parameters in the mixed membership model for rank data this 
is indeed the case; however, in the case of the support parameters p, it is not. 

The full conditional distributions of the latent variables z it and the mixed membership parame- 
ters 7 Tj are readily available. In particular. 


Multinomial 


• , I , I TtilQlit Tti2Q2it 

lal 1 TTA > TTk 

t \ Z-^k'=l ^ik'Qk'it / ;k'—l 


KiKQKit 


_ 5 * * ■ 5 _ 

= l 7T ik'Qk'it 2-^k' = 1 ik'Qk'it j 

where (p. lt is defined as in (21.1) for k = 1, 2, . . . , K, i = 1, . . . , M, t = 1, . . . , n,; and 

for i = 1 , . . , , M. 


7 r, ; ~ Dirichlet I a.\ + E Zitl 5 
V t = 1 


, OIK + ^ ZitK 
t=l t 


In the case of the support parameters, the full conditional distributions are 



" M rii | 

f \ %itk~ 

N 

P{P j J 7r > x ; z } °C 

nn 

ikPkc(i,t ) [ 

IKf 1 

l Xy S =t Pkc{i,s) J 


_i=it=i * 

3=1 


(21.4) 


Due to the form of the likelihood function based on the rank data, the complete conditional distri- 
bution of the support parameters is not readily available for sampling and a Gibbs sampling step 
cannot be implemented. However, a Metropolis step can be used to sample the support parameters. 
Thus, a Metropolis -within-Gibbs sampler (Carlin and Louis, 2000) can be used to sample from the 
posterior for all model parameters. 

In any Metropolis-based algorithm, the rate of convergence of the chain depends on the relation- 
ship between the proposal and target distributions. The use of a proposal distribution which closely 
mimics the shape and orientation of the target distribution provides an improved rate of convergence 
and good mixing. 

We start to construct a proposal distribution by examining the logarithm of the full conditional 
of the support parameter p k (21.4) which is of the form 

M rii f N 'l N 

logP{pj7T,X,Z>£5> fe logp fec (i,t) - log ^ p fec(i , s ) > + ^(Pj - l)logp fe j. 

i= 1 t = 1 l s=t ) j= 1 

The function — log(-) is a convex function and thus the term — log Pkc(i,s) can be approxi- 
mated (in fact lower-bounded) by a hyperplane that is tangent to the function at the currently sam- 
pled value of p k . The resulting function is the log of a gamma density and this can, in turn, be 
replaced by the log of a Gaussian density because the shape parameter is typically quite large. Thus, 
the proposal distribution for pkj emerges as a Gaussian density with mean and variance dependent 
on the previously sampled values of the model parameters. As the Gaussian distribution extends 
beyond the [0, 1] interval in which the support parameters lie, proposed values from this surrogate 
proposal must be suitably normalized. 

When estimating parameters via MCMC algorithms, some special features of the mixed mem- 
bership model for ranking data require attention. A fundamental issue in the fitting of any mixture- 
based model within a Bayesian framework is that of label switching. This arises because of the 
invariance of posterior distribution to permutations in the labeling of the profiles. The methods pro- 
posed for dealing with label switching, including Stephens (2000), Celeux et al. (2000), and Jasra 
et al. (2005) need to be considered to avoid this issue. The online relabeling algorithm of Stephens 
(2000) was found to be an effective method for handling this issue; this algorithm implements rela- 
beling as the MCMC algorithm progresses rather than as a post-processing step. 

Full details of the Metropolis -within-Gibbs algorithm for fitting this model are given in Gormley 
and Murphy (2009). 
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Another feature of the mixed membership model is the need to infer the model dimensionality, i.e., 
the number of voting profiles (K) needed to appropriately model the electorate. Within the Bayesian 
paradigm, the natural approach would be to base inference on the posterior distribution of K given 
the data x, P{A' |x}. However, this posterior can be highly dependent on the model definition and 
is typically computationally challenging to construct. A comprehensive overview and comparison 
of model selection criteria within the context of mixed membership models is provided in Joutard 
et al. (2008). 

In this application of the mixed membership model for rank data, the deviance information cri- 
terion (DIC), introduced by Spiegelhalter et al. (2002), is used to choose an appropriate model. The 
DIC criterion penalizes the posterior mean deviance of a model by the “effective number of parame- 
ters.” The effective number of parameters is derived to be the difference between the posterior mean 
of the deviance and the deviance at the posterior means of the parameters of interest. Explicitly for 
data x and parameters 6 the DIC is 


DIC = D{8)+p D , 

where D(6) = — 2 log[P(x|6*)] + 21og[/i(x)] is the Bayesian deviance and /i(x) is a function of 
the data only. The effective number of parameters is defined as pr> = D(0) — D{6). The criterion 
has an approximate decision theoretic justification. In any case, models with small DIC values are 
preferable to models with large DIC values. The choice of 6 in the calculation of DIC is important, 
and we use 9 = (7r, p) because these are the primary model parameters of interest. 


21.5 Application to the 2002 Irish General Election 

The mixed membership model for rank data was applied to the voting data from the Dublin North 
constituency in the 2002 Irish general election. This study aims to establish the existence of different 
voting profiles in the electorate and to establish how voters align themselves with these profiles. This 
investigation will thus provide an enhanced insight into the actual voting behaviors exhibited in this 
electorate. 

The Metropolis-within-Gibbs sampler, as outlined in Section 21.4.1, was run over 50,000 itera- 
tions with a burn-in period of 10,000 iterations. The model was fitted with K = 1, 2, . . . , 7 voting 
profiles in order to establish the appropriate number of profiles to adequately model the data. 

For each value of AT, the DIC value was computed (shown in Figure 21.1). The plot shows 
a sharply decreasing trend when K increases from 1 to 3, and the DIC values decrease slightly 
thereafter. Consequently, the fitted models for K > 3 were examined and it was determined that the 
K = 3 model was most appropriate because the models with K > 3 included extra extreme profiles 
that didn’t differ greatly from those in the model with K = 3. 

21.5.1 Support for the Candidates 

The marginal posterior density of the support parameters for each candidate within the three voting 
profiles are illustrated in Figure 21.2; a violin plot (Hintze and Nelson, 1998; Adler, 2005) is used 
to show these marginal posterior densities. The violin plot combines a boxplot and a kernel density 
estimate; the length of the violin corresponds to the length of the box in a boxplot but the breadth 
of the violin shows a back-to-back plot of a kernel density estimate of the values. The marginal 
probabilities for the voting profiles are (0.323, 0.324, 0.353), respectively. 

The three voting profiles have distinct and intuitive interpretations within the context of the 2002 
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FIGURE 21.1 

Values of the DIC for the mixed membership model for rank data fitted to the 2002 Dublin North 
constituency data over different values of the number of voting profiles I\ . 


Irish general election. The four elected candidates have high support in at least one of the voting 

profiles and some other prominent candidates also have high support. 

Voting Profile 1: Non-mainstream opposition and protest voters. 

Figure 21.2(a). The posterior mean support parameter estimates for the candidates in this 
voting profile suggest that a pure member of this voting profile should strongly support the 
non-mainstream opposition parties and single issue/protest candidates. Clare Daly (Socialist 
Party) has the largest support, and she would be characterized as a major candidate in the non- 
mainstream opposition in Ireland. Despite having such high support in this voting profile she 
failed to get elected. Trevor Sargent (Green Party) was leader of the Green Party at the time of 
the election and the 2002 election saw the party increase its number of seats in the Dail from 
two to six seats thus moving them towards the mainstream opposition. Sean Ryan was a Labour 
party candidate; the Labour party has a diverse range of support within the Irish electorate so 
it could be considered to be a mainstream party, but it would also have to appeal to voters who 
don’t support other mainstream parties. Interestingly, candidates that received very few first 
preference votes (e.g., Eamon Quinn and David Walshe) have appreciable support in this voting 
profile. The non-election of Claire Daly, despite having high support, can be explained by the 
fact that Trevor Sargent and Sean Ryan were only elected in the later counts (see Table 21.1), 
so Claire Daly didn’t have the opportunity to receive transfers from voters who gave the other 
candidates higher preferences than her. 

Voting Profile 2: Mainstream opposition voters. 

Figure 21.2(b). The support parameters for Trevor Sargent (Green Party), Sean Ryan (Labour), 
Nora Owen (Fine Gael), and Cathal Boland (Fine Gael) are all large relative to the other can- 
didates. Fine Gael was the largest opposition party before the election and their support here 
suggests that this voting profile shows support for the mainstream opposition parties. Labour 
was the second largest opposition party and traditionally Labour and Fine Gael have formed 
coalition governments, so they share much support amongst the voters. The 2002 election saw 
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(c) Voting Profile 3: Fianna Fail. 


FIGURE 21.2 

Violin plots of the posterior samples for the support parameters. The plot shows the marginal pos- 
terior density for each support parameter, for each of the twelve candidates and the three voting 
profiles. The abbreviation used for each candidate’s name is given in Table 21.1. The elected candi- 
dates are marked with an asterisk. 
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the Green party move towards becoming a mainstream opposition party; this is reflected in this 
voting profile too. Prior to the election, there was some discussion in the print media about Fine 
Gael, Labour, and the Green party forming a coalition government if they gained enough seats, 
but this did not happen. 

Voting Profile 3: Fianna Fail voters. 

Figure 21.2(c). The posterior mean support parameter estimates for the candidates in this profile 
reveal that only those voters with a high degree of profile membership should give strong support 
to the three Fianna Fail candidates. All other candidates have very low support. 

The division of the voters into three profiles provides a systematic method for decomposing the 
electorate into a small number of profiles. The relevance of the revealed profiles is supported by the 
exploratory analysis of these data in Laver (2004). Interestingly, the division of candidates amongst 
the profiles corresponds very closely to the hierarchical decomposition of the candidates and parties 
in Dublin North as found in Huang (2011) and Huang and Guestrin (2012). 


21.5.2 Mixed Membership Parameters for the Electorate 

The unique feature of the mixed membership model is that the partial memberships of the voting 
profiles for each voter are inferred directly when estimating the model. The entropy (Shannon, 
1948) of each voter’s mixed membership vector measures the degree to which they exhibit mixed 
membership across voting profiles. In fact, the exponential of the entropy can be seen as the effective 
number of profiles (Campbell, 1966; White et al., 2012) which are required to model voter i’s 
preferences. Figure 21.3 shows a histogram of the exponentiated entropy values for the Dublin 
North voters. These show that there is significant evidence of mixed membership for the voters with 
many being effectively members of two or more of the profiles. 



Exp(Entropy) 


FIGURE 21.3 

A histogram of the exponential of the entropy values for each voter’s mixed membership parameter. 
The values shown give an “effective number of profiles’’ needed to model each voter. 

The voter with the lowest effective number of profiles has a membership vector tt ; = 
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(0.068,0.885,0.047) and they recorded the vote x t =(Boland, Owen, Sargent, Ryan, Goulding, 
Quinn, Walsh, Daly, Glennon, Wright, Kennedy, Davis). Since their highest preference choices all 
have high support in Voting Profile 2, it is clear why they have particularly high membership to this 
profile and low membership to other profiles. The voter with the highest effective number of profiles 
has a membership vector n, l = (0.333,0.336,0.331) and they recorded the vote x t =(Goulding, 
Daly, Ryan, Boland, Owen, Glennon, Wright, Kennedy). In this case, the voter’s highest preference 
votes have high support in different profiles, so the mixed membership model suggests that all three 
profiles are needed to model their preferences. 

We can further explore the mixed membership vectors by dividing the voters into groups, as- 
signing each voter to the voting profile for which they have the highest membership score (i.e., their 
modal profile membership). We construct a kernel density estimate of the mixed membership pa- 
rameter for each voting profile for each of the groups of voters (Figure 21.4). Clearly, a significant 
proportion of the voters who have the strongest affiliation to Voting Profiles 1 and 2 also have a 
strong affiliation to at least one other profile. In contrast, voters who have strongest affiliation to 
Voting Profile 3 tend to have very little affiliation to the other voting profiles. This suggests that 
Voting Profiles 1 and 2 are closer, thus voters exhibit more mixed membership between these two 
profiles. This makes intuitive sense within the context of the 2002 Irish general election as Voting 
Profile 3 represents the current government party, with Profiles 1 and 2 representing two different 
types of opposition. 
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(a) Modal members of Profile 1 . 



n 

(b) Modal members of Profile 2. 



K 

(c) Modal members of Profile 3. 

FIGURE 21.4 

Kernel density estimates of the membership parameters for those voters most likely to be character- 
ized by each profile. 
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Posterior predictive simulation (Gilks et al., 1996) was employed to assess model fit. Subsequent 
to a burn-in period of 10,000 iterations, 40,000 samples thinned every 100th iteration were drawn 
from the posterior distribution P{7r, p, z|x}, giving R = 400 sets of parameters simulated from the 
posterior. A predictive election dataset x r was then simulated from the mixed membership model for 
the rank data, given each of the r = 1 , ,R draws of the parameters from the posterior distribution. 
Due to the discrete and structured nature of the data, it is difficult to fully assess model fit, so first 
order summaries were used. For the simulated votes, the number of first preference votes obtained by 
the twelve candidates was recorded. Figure 21.5 illustrates the number of first preferences received 
by each candidate in each simulated posterior predictive dataset, and in the Dublin North voting 
data. 

The posited model appears to capture the main structure of the data, but there is some discrep- 
ancy between the observed and the simulated values. The discrepancy can be explained by the fact 
that the support parameters p are used to model the probability of candidate selection at all pref- 
erence levels and thus the posterior estimates for these parameters depend on all preference levels 
rather than just first preferences. So, this may lead to a slight under or over estimation of the number 
of first preference selections for a candidate. 
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FIGURE 21.5 

This plot shows the posterior predictive counts for each candidate in the Dublin North constituency. 
Each circle indicates the number of first preference votes received by the twelve candidates in each 
of 400 simulated posterior predictive datasets. The crosses indicate the number of first preferences 
received by each candidate in the actual voting data. 


21.6 Conclusion 

A mixed membership model for rank data has been described and applied to the analysis of a large 
election dataset. It has been shown that in the context of analyzing rank response data, the model 
provides scope to examine a population for the presence of preference profiles, to estimate the 
characteristics of these profiles, and to investigate the mixed membership of population members to 


456 


Handbook of Mixed Membership Models and Its Applications 


the profiles on a case-by-case basis. The loss of information which may result from a hard clustering 
of the data is avoided by providing a soft clustering of the population. In particular, a hard clustering 
forces each voter to belong to one and only one cluster, so even if they are best characterized by 
a single cluster, any unusual aspects of their voting preferences are lost in the hard clustering. In 
contrast, the mixed membership model provides a parsimonious description of voting preferences 
because complex preference patterns can be captured using the mixed membership machinery. 

The method provides an alternative modeling framework to the many mixture modeling ap- 
proaches for rank data (Stern, 1993; Murphy and Martin, 2003; Gormley and Murphy, 2006; Busse 
et al., 2007; Gormley and Murphy, 2008a; Meila and Chen, 2010). In particular, Gormley and Mur- 
phy (2008a) developed a finite mixture of Plackett-Luce models for modeling PR-STV data which 
provides a modeling framework. However, when studying large voting datasets with diverse candi- 
dates, a large number of mixture components are needed to appropriately model the data. In contrast, 
the mixed membership model can represent voting in such elections with many fewer profiles. 

The model described herein can be fitted in a Bayesian paradigm using an efficient Markov 
chain Monte Carlo scheme. The method is able to explore the posterior efficiently because the pro- 
posal distributions developed for sampling the support parameters, which don’t have a closed-form 
conditional posterior, are accurate approximations of the parameter conditional posterior distribu- 
tions. Recently, Caron and Doucet (2012) developed a Gibbs sampling method for the Plackett-Luce 
model and this could be adapted to fit the mixed membership model outlined herein, thus improving 
the accuracy of model inference. An alternative method for fitting such models would be to use vari- 
ational Bayesian (VB) methods or expectation propagation (EP); Weng and Lin (2011) developed 
an online VB algorithm and Guiver and Snelson (2009) developed an EP algorithm for a single 
Plackett-Luce model; there is potential to extend these methods to the mixed membership model 
herein. 

The mixed membership model for rank data could be developed in several directions. In terms of 
the application in this chapter, further model accuracy could be attained by imposing a hierarchical 
framework — a hyperprior could be introduced for the Dirichlet parameters a and 8 of the mixed 
membership and support parameter priors, respectively; such hierarchical priors are employed in 
Pritchard et al. (2000) and Erosheva (2003). 

The issue of model choice for mixed membership models is still problematic (Joutard et al., 
2008). The combination of the use of DIC (Spiegelhalter et al., 2002) and posterior predictive model 
checks (Gilks et al., 1996) provided a suitable method in this application, but there were different 
numbers of extreme profiles (K) that achieved similar fit. Thus, there remains the need for more 
automatic model choice methods. 

Recently, a number of models have been developed that capture underlying group structure for 
rank data when concomitant information for the voters is also available (Gormley and Murphy, 
2008b; Francis et al., 2010; Lee and Yu, 2010; 2012; Li et al., 2012). It would be worthwhile 
to extend the mixed membership modeling framework for rank data to include such concomitant 
information. Such a modeling extension would help explain the structure revealed by the mixed 
membership model for ranked data. 


Appendix : Data Sources 

The 2002 Dublin North constituency voting data was made available by the Dublin County Return- 
ing Officer. The data are available from the authors on request. 
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Quantitative methods for analyzing social networks have primarily focused on either single net- 
work statistical models (e.g., Airoldi et al., 2008; Hoff et al., 2002; Wasserman and Pattison, 1996) 
or summarizing multiple networks with descriptive statistics (e.g., Penuel et al., 2013; Moolenaar 
et al., 2010; Frank et al., 2004). Many experimental interventions and observational studies however 
involve several if not many networks. 

To model such samples of independent networks, we use the Hierarchical Network Models 
framework (Sweet et al., 2013, HNM) to introduce hierarchical mixed membership stochastic 
blockmodels (HMMSBM) which extend single-network mixed membership stochastic blockmod- 
els (Airoldi et al., 2008, MMSBM) for use with multiple networks and network-level experimental 
data. We also introduce how covariates can be incorporated into these models. 

The HMMSBM is quite flexible in that it can be used on both intervention and observational 
data. Models can be specified to estimate a variety of treatment effects related to subgroup member- 
ship and well as covariates and additional hierarchical parameters. Using simulated data, we present 
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several empirical examples involving network ensemble data to illustrate model fit feasibility and 
parameter recovery. 


22.1 Introduction 

A social network represents the relationships among a group of individuals or entities and is com- 
monly illustrated by a graph. The nodes or vertices represent individuals or actors and the edges 
between them the ties or relationships between two individuals. These edges may be directed, sug- 
gesting a sender and receiver of the interaction, or undirected, suggesting reciprocity between the 
two nodes; a network depicting collaboration is likely to be an undirected graph whereas a net- 
work depicting advice-seeking would be a directed graph. Figure 22.1 shows advice-seeking ties 
among teachers regarding two different subjects. Network ties are part of a larger class of obser- 
vations termed relational data, since these data reflect pairwise relationships, such as the presence 
and direction of pairwise ties. Since relationships are pervasive, it is unsurprising that relational 
data methodology has applications in a wide variety of fields, including biology (Airoldi et ah, 
2005), international relations (Hoff and Ward, 2004), education (Weinbaum et ah, 2008), sociol- 
ogy (Goodreau et ah, 2009), and organizational theory (Krackhardt and Handcock, 2007). 



FIGURE 22.1 

Two social networks, depicting asymmetric advice-seeking behavior among two groups of teachers, 
from Pitts and Spillane (2009). Vertices, or nodes, represent individual teachers. Arrows, or directed 
edges, point from advice seeking teachers to advice providing teachers. 


Two prominent quantitative methods for analyzing social networks are descriptive network 
statistics and statistical modeling. Descriptive network statistics are useful for exploring, summa- 
rizing, and identifying certain features of networks, which are then used as covariates in other sta- 
tistical models. Common statistics include density, the total number of ties; degree, the number of 
ties for any one node; betweenness, the extent that a node connects other nodes; and other observed 
structural elements such as triangles. Kolaczyk (2009) provides a comprehensive list. Descriptive 
statistics are inherently aggregate, so using them to represent a network or to compare networks is 


Hierarchical Mixed Membership Stochastic Blockmodels 


465 


problematic. For example. Figure 22.1 show two networks with similar density (22 ties among 27 
nodes and 19 ties among 28 nodes); however, the structure of the networks is quite different. 

Alternatively, a statistical social network model formalizes the probability of observing the en- 
tire network and its various structural features. Current methods generally fall into one of three 
categories, exponential random graph models (ERGM), latent space models (LSM), and mixed 
membership stochastic blockmodels (MMSB); see Goldenberg et al. (2009) for a comprehensive 
review. An exponential random graph modeKWasserman and Pattison, 1996) represents the proba- 
bility of observing a particular network as a function of network statistics. The latent space model 
(Hoff et al., 2002) assumes each node occupies a position in a latent social space. The probability 
of a tie between two individuals is modeled as a function of the pairwise distance in this space. 
Stochastic blockmodels cluster nodes to one of a fixed number of finite groups, and the probability 
of a tie between two nodes is determined by the group membership of each node. The mixed mem- 
bership stochastic blockmodel (Airoldi et al., 2008) allows nodes to belong to multiple groups so 
that group membership may vary by node interaction. 

Most modeling methodology for social networks focuses on modeling a single network, but 
in many applications more than one network may be of interest. The study of multiple networks 
can be divided into three classes: studying multiple types of ties among nodes of one network 
(e.g., friendship ties and collaboration ties), studying one network over time, and studying a single 
measure on multiple isolated networks. There has been a fair amount of work done for the first two 
cases. Fienberg et al. (1985) showed how loglinear models can be used to model multiple measures 
on a single network and Pattison and Wasserman (1999) extended this work for the logit forms of p* 
models. Longitudinal methods to model a single network over time have been extensively studied. 
The three categories of models each have known longitudinal extensions: Hanneke et al. (2010) 
introduced temporal ERGMs which are based on a discrete Markov process; Westveld and Hoff 
(2011) embedded an auto-regressive structure in LSMs; and Xing et al. (2010) added a state-space 
model to the MMSBM. 

Modeling a sample of isolated networks has only recently attracted sustained attention. Moti- 
vated by social networks of teachers in education research. Sweet et al. (2013) introduced hierarchi- 
cal network models (HNM), a class of models for modeling ensembles of networks. The purpose 
of this paper is to use the HNM framework to formally introduce hierarchical mixed membership 
stochastic blockmodels (HMMSBM) which extend the MMSBM for use with relational data from 
multiple isolated networks. 

In the next section, we formally define the MMSBM for a single network, present a covariate 
version of a MMSBM, and introduce an MCMC algorithm for estimation. In Section 22.3, we 
present the HNM framework and formally define the HMMSBM. Extending our MCMC algorithm 
for a single network, we present an algorithm for fitting the HMMSBM that we illustrate with two 
examples. We conduct a simple simulation study for sensitivity analysis and conclude with some 
remarks regarding estimation and utility of these models. 


22.2 Modeling a Single Network 

A single social network Y among n individuals can be represented by an adjacency matrix of di- 
mension n x n, 


r n 

Vl2 ■ 

•• Y ln 

Y n l 

Y n 2 • 

■ ■ Y 

J nn 


( 22 . 1 ) 
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where Y i: j is the value of the tie from i to j. These ties might be binary, indicating the presence or 
absence of a tie, or an integer or real number, indicating the frequency of interaction or strength of 
a tie. For the purposes of this paper, we restrict ourselves to binary ties. 

In many contexts, individuals in the network belong to certain subgroups. In a school faculty 
network, for example, teachers belong to departments. However, these group memberships are of- 
ten not directly observed and can only be inferred through the network structure. Figure 22.2 (left) 
shows an adjacency matrix for networks generated from a stochastic blockmodel in which indi- 
viduals belong to one of four groups. A black square indicates the presence of a tie between two 
individuals. Ties within groups are much more likely than ties across groups. Blockmodels are most 
appropriate for relational data with this structure and a variety of blockmodels have been studied 
(see Anderson and Wasserman, 1992). 

Stochastic blockmodels assign each individual membership to a block or group, and assignment 
may either be observed or latent. Tie probabilities are then determined through group membership; 
usually within-group tie probabilities are modeled to be much larger than between-group tie proba- 
bilities, resulting in the block structure shown in Figure 22.2. 



(a) 


(b) 


FIGURE 22.2 

A network generated from a stochastic blockmodel where group membership is not mixed (a). Each 
node is assigned a group membership which determines the probability of ties. A network generated 
from a MMSBM (b). Node membership may vary with each pairwise interaction. Note, the n x n 
sociomatrix displays a black box for each tie and white otherwise. 


Mixed membership stochastic blockmodels (MMSBM) instead allow block membership to be 
defined for each interaction with a new partner. Rather than assuming individual i is a member of 
block k for all interactions, the block membership is determined anew for each interaction. Indi- 
vidual i might belong to block k when interacting with individual j but belong to block k! when 
interacting with individual j'. 

We define the MMSBM as a hierarchical Bayesian model (Airoldi et al., 2008), 

Yij ~ BernoulIi(Sij T BRji) 

Sij ~ Multinomial^ 1,0,) 

Rji ~ Multinomial(l,9j) 

Oi ~ Dirichlet(A) 
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( 22 . 2 ) 

where the group membership probability vector for individual i is 9.,, and specific group member- 
ships are determined through a multinomial distribution. Sij is the group membership indicator 
vector of i when initiating interaction with j, and Rji is the group membership indicator vector of 
j when acting in response to i. Notice the stochastic nature of S i: j and R :r ;\ each is sampled for 
every interaction from i to j, allowing individual group memberships to vary. The value of Y. rj is 
determined based on a block dependent probability matrix B , where Ij f nl is the probability of a tie 
from an individual in group £ to an individual in group m. 

The hyperparameter A may be fixed and known or estimated as a parameter. The dimension 
of A identifies the number of groups (g) and the value of A determines the shape of the Dirichlet 
distribution on the ^-simplex. The hyperparameters (a/, b rn ) are generally elicited so that within- 
group tie probabilities are higher than across-group tie probabilties. 

22.2.1 Single Network Model Estimation 

We developed a Markov chain Monte Carlo (MCMC; Gelman et ah, 2004) algorithm to fit the 
MMSBM. The joint likelihood of the model can be written as the following product: 


P(Y\S, R, <9, A, B)P(S\9)P(R\9)P(d\X)P(B)P(X) 

= ]]P(Y ij \S ij ,R ji ,0 i ,6 j ,\,B)]]P(S ij \e i )P(R ji \6 j )]]P(0 i \\)llP(B)P(\) . (22.3) 

i £,m 

The complete conditionals for 9 , R, S, B can be written in a closed form, so we use Gibbs 
updates for each. Full conditional posterior probability distributions are listed below. Define ... to 
represent all other parameters and data in the model, and let /:'* represent the group indicated by R r , 
and m* represent the group indicated by Sij . 


mi- 

. . ) oc Dirichlet (A + S l3 + R xt ) 

m^i- 

J J 

. . ) cx Multinomial (p) 


Pk = 0ikBke* Y ' ] (1 — 

P(Rji\. 

. . ) cx Multinomial ( q ) 


Qk — 9ikB m *k (1 B m *k)^ 

P(B (m \ . 

■ • ) oc Beta (cig -F ^ ^ , b m T ^ ^ Y \j 


( ij )* (ij)* 


(22.4) 


where (ij)* is an (£, ?n)-specific subset of i = 1 ,..,n and j = l,..,n such that Sij = £ and 
Rji = to. In addition, we incorporate a sparsity parameter p (Airoldi et ah, 2008). The absence of 
ties can be attributed to either rarity of interaction across groups or lack of interest in making across- 
group ties. For example, teachers in departments in schools may have few collaborative ties outside 
of their department because they interact less often with teachers outside their department but also 
because they would rather interact with those who teach the same subjects. The sparsity parameter 
helps to account for sparsity in the adjacency matrix due to lack of interaction. The probability of 
ties from group £ to m is therefore modeled as pB( rn . 

If A is estimated, we use a common parameterization and let A = 7 ^ where 7 = Yhk=\ ^ 
and Y^k=i £ = 1 (Erosheva, 2003). We can think of 7 as a measure of how extreme the Dirichlet 
distribution is, i.e., small values of 7 imply greater mass in the corners of the g— simplex. Since 
£ sums to 1, it is an indirect measure of the probability of belonging to each group. Equal values 
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of £ suggest equal sized groups. As defined, 7 and £ are independent and we update each using 
Metropolis steps. 

To update 7 , we use a gamma proposal distribution with shape parameter v 1 and rate parameter 
selected so that the proposal distribution has a mean at the current value of 7 . The value of ix, is 
then tuned to ensure an appropriate acceptance rate. Then the proposed value of 7 s +1 is accepted 
with probability min{ 1, R} where R = P ^ s )\ } 

To update £, we use a uniform Dirichlet proposal distribution centered at the current value of 
£. Thus, £( s+1 ) ~ Dirichlet{v^g£^ s ^) where vc is the appropriate tuning parameter. The proposed 

p(; (3+1) |... ) p (& ) |g ( ° +1) t 


value of ^ s+1 ) is accepted with probability min{ 1, JR} where now, R = 


P({(»)|...) P(£<»+D| s ^)) 


22.2.2 Empirical Example 

To illustrate fitting a single network MMSBM, we use the Monk data of Sampson (1968). While 
staying with a group of monks as a graduate student, Sampson recorded relational data among the 
monks at different time periods during his year-long stay. Toward the end of his stay, there was a 
political crisis which resulted in several monks being expelled and several others leaving. 

We use relational data from three time periods prior to the crisis. For each time period, we 
have nominations for the three monks they like best. These data have been aggregated into a single 
adjacency matrix, where Y l:] = 1 if monk i nominated j as one of his top three choices during any 
of the three time periods. Y tl is undefined. 

Based on past work suggesting three subgroups of Monks (Breiger et al., 1975), we fit the 
following MMSBM: 


Yjj ~ Bernoulli(SfjBRji) 
Sij ~ Multinomial(l,Oi) 
Rji ~ Multinomial(l,6j) 
6i ~ Dirichlet( 7 )^ 

Ba ~ Beta( 3, 1) 

Bi m ~ Beta( 1, 10) , t m 
7 ~ Gamma (1,5) 

£ ~ Dirichlet( 1, 1, 1) , 


(22.5) 


where 1 — p = 1 — , and N = 18, the number of monks. 

We sample MCMC chains of length 15,000, keeping the last 10,000, and retaining 1 out of every 
25 steps for a posterior approximation of 401 samples. To assess our fit, we compare the original 
sociomatrix shown on the left in Figure 22.3 to our fitted model. Using posterior means for each 
parameter, we illustrate the probability of a tie between two monks by color, with low probabilities 
in shades of blue and high probabilities in red, orange, and yellow ((c). Figure 22.3). 

22.2.3 Incorporating Covariates into a MMSBM 

While the MMSBM captures block structure, network ties may also form based on other individual 
similarities independent of or unrelated to the existing block structure. While teachers in schools 
may belong to departments, they may also belong to groups based on unobserved characteristics. 
But some ties might also form based on proximity in the school building, teaching the same group 
of students, or attending new teacher seminars together, independently of the overarching grouping 
mechanism. 

We present a simple extension for the MMSBM to include covariates as 
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FIGURE 22.3 

The original sociomatrix (a) versus the probability of a tie as determined by our model using poste- 
rior means (b). In general, estimated tie probabilities mirror the true tie structure. The legend shows 
increments of 0 . 1 , with all values except 0 being the upper endpoint of the continuous class of colors 
(c). 


Yij ~ Bemoulli(pij) 

_ expjlogit (5^ T logit ( B)Rjj ) + aXjj} 
1 + expjlogit (Sij T BRjj) + aXij} 
Sij ~ Multinomial (1, 6i) 

Rji ~ Multinomial(l,9j) 

6i ~ Dirichlet(X) 

BetU^dpm , bpm ) ; 


( 22 . 6 ) 


where Xij is a covariate and a is the coefficient for that covariate. 

Model (22.6) can be fit using a MCMC algorithm similar to (22.4), with a more complicated 
sampling distribution. We use the same Gibbs update for 0, and the same Metropolis updates for 7 
and £, as presented in our standard MMSBM (22.4). We use the following Gibbs updates for S l:/ 
and Rji, 


P{Sjj . . . ) ex Multinomial (jp) 

_ 0 expjlogit (B)m* 

Pk lk l + expjlogit ( B)m * + aXij} 
P(Rji | . . . ) ex Multinomial ( q ) 

_ q expjlogit ( B) m * k + 

Qk lk 1 + expjlogit (. B) m * k + aXij} ' 


(22.7) 


where again l is the group indicated by R t , and m is the group indicated by .S' ? J . We reparameterize 
B and use logit (B) throughout our MCMC algorithm. The entries in B no longer have a direct 
sampling and we instead use Metropolis-Hastings updates. 

To take advantage of random walk updates we reparameterize B as logit (73), and having an 
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unbounded support for our proposal distributions allows us to update the diagonal and off-diagonal 
elements using the same proposal distribution. Note that Sij T logit (B)Rji and logit (Sij T BRjf) 
are equivalent. 

Thus, to update an entry of logit ( B ), logit {B)\ , we propose a new entry, logit (B)it^, using a 
normal random walk with mean logit {B) 8 tm where variance is determined by a tuning parameter to 
ensure appropriate acceptance rates. The probably of accepting this new entry is min{ 1, f?}, where 

R = \ We illustrate this algorithm in Section 22.3.4. 


22.3 Modeling an Ensemble of Networks 

22.3.1 The Hierarchical Network Framework 

Consider a collection of K networks ¥ = (Y-f , . . . ,Y k ) where Yk = (Y\\ k , . . . ,Y nknk k). The 
hierarchical network framework for this collection ¥ is given as 

K 

p(Y\x,e) = n P{Y k \x k ,e k = (e lk ,..,e pk )) 

k = 1 

(0r, • • ■ , Ok) ~ B(Oi, • • • , &k\W u (22.8) 

where P(Y k \X k , Q k = (9 ± k , .., 9 pk )) is a probability model for network k with covariates X k . 

Notice that this model structure specifies that networks may be independent of each other de- 
pending on choice of W, but need not be. Additional hierarchical structure can be specified by 
including additional parameters ip. Notice also that we purposely omit any within-network de- 
pendence assumptions. Thus, this framework allows for a variety of dependence assumptions both 
across and within networks but is also flexible in that any social network model can be used. For 
example. Sweet et al. (2013) uses this framework to introduce hierarchical latent space models, a 
latent space modeling approach for multiple isolated networks. 

22.3.2 The Hierarchical Mixed Membership Stochastic Blockmodel 

Let Y l]k be a binary tie from node i to node j in network k. The hierarchical mixed membership 
stochastic blockmodel is specified as 

P(Y\S,R,B,0,j) 

K 

=nn P (Yij k | $ij k i Pj ik 7 B k , e k , lk )P{s ijk \e ik )P{R jik \e jk ) JJ p(e tk \x k ), (22.9) 

k — 1 i 

where S, lk is the group membership indicator vector for person i when sending a tie to person j 
in network and Rj ik is the group membership indicator vector for j when receiving a tie from 
i in network k\ B k is the network specific group-group tie probability matrix, and 0 lk is the group 
membership probability vector for node i in network />:. 

This is easily presented as a hierarchical Bayesian model: 

Yijk r '^ Bernoulli ( Sjj k B k Rji k ) 

Sjik ~ Multinomial (&i k , 1) 

Rjik ~ Multinomial(6j k , 1 ) 

Oik ~ Dirichlet(X k ) 
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Btmk BetU ( (Lprrih- ■ bp ni p- ) . (22.10) 

We impose our hierarchical structure by requiring that the parameters come from some common 
distribution, and in fact, this framework becomes particularly interesting in cases where parameters 
are shared across networks. We present several examples in the next section. 

Examples of HMMSBMs 

The hierarchical structure of the HMMSBM naturally lends itself to pooling information across 
networks and we present several extensions of (22.10). 

A simple extension is an HMMSBM for experimental data in which the treatment is hypothe- 
sized to affect a single parameter. The networks in the treatment condition would be generated from 
the same model and the control condition networks would be generated from a different model. For 
example, suppose we examine teacher collaboration networks in high schools. Typically we would 
expect to see teachers collaborating within their own departments and these departments operating 
mostly in isolation. But we could imagine an intervention whose aim is to increase collaboration 
across departments. In contrast, teachers in treatment schools are more likely to have across depart- 
ment ties than teachers in control schools. Such a model is given as 


^ijk It ernoiil i i ( Sjj p- H p- Rjjp - ) 

Sijk ~ Multinomial (Oik, 1) 

Rjik ~ Multinomial(Ojk, 1) 

Oik ~ Dirichlet(Xk), where A^ = Ao + T^( 1 — Ao)(l — a) 

R(imk ^ BetCi((lf rn k • Of/mk) 

a ~ Uniform(0 , 1) , (22.11) 

where 71- is the indicator for being in the treatment group, 1 is the vector (1, 1) with length g, 
and g is the number of groups. The treatment effect a is a proportion of how similar the group 
membership profiles are to the control group as compared to a uniform distribution on the simplex. 

Rather than constraining each network to have a constant network level parameter, e.g., Ao, 
we might instead model network parameters generated from a single distribution, introducing an 
additional level to the hierarchy. Suppose we are interested in how variable the membership proba- 
bilities vectors are across networks, for example we expect teacher collaboration networks to vary 
depending on the organizational structure in the schools. Then we could estimate the distributional 
hyperparameters that generate these membership probabilities (0). 

An example of such a model is 

F/'-y k r '^ Bernoulli(Sijk hip- Rjik ) 

Sijk ~ Multinomial(Oik, 1) 

Rjik ~ Multinomial(6jk, 1) 

Oik ~ DirichletijkXk) 

Btmk ^ Beta(ae m k, bjmk) 

7 fc ~ Gamma (t, p) 

£k ~ Dirichlet (c) 
r ~ Gamma (a T , b T ) 

ft ~ Gamma (a p, bp) . (22.12) 
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Thus we allow 7 ^. to vary by network and then estimate an overall mean and variance as determined 

by (t,/3). 

Finally, we introduce a covariate MMSBM in Section 22.2.3 , which we can easily extend for 
multiple networks. Consider again our networks of teacher collaboration. A tie-level variable indi- 
cating whether two teachers serve on the same committee may be a covariate of interest, such as 
Xijk = 1 if teacher i and j in school k serve on the same committee, and we may want to estimate 
this effect across all networks. A simple model in which the covariate effect is the same across 
networks is given as 

Y ijk ~ Bernoulli{pijk) 

_ expjlogit (Sjjk 1 logit ( B k )R jik ) + aX ijk } 
l + exp{logit {S ijk T B k R jik )} 

Sijk ~ Multinomial {6 ik,f) 

Rjik ~ Multinomial ( Op- , 1) 

0 ik ~ Dirichlet{ 7 fc£fc) 

Btmk Beta((if rn k , bj mk ) . (22.13) 

These are merely a few models from the myriad of possibilities. Network-level experiments can 
affect other parameters in the model; indeed we can include additional hierarchical structure when 
modeling experimental data. Moreover, observational data may not need the full structure specified 
above and covariates can be incorporated in other ways as well. 

22.3.3 Model Estimation 

We use an MCMC algorithm for fitting HMMSMs that is similar to the one used for fitting the single 
network MMSBM. We first present MCMC steps for fitting the model given in (22.10), and then we 
discuss how these steps need to be augmented for models (22.1 1)— (22.13). 

For each network k, we use Gibbs updates for Ok, Sk, Rk, B k - The complete conditionals for 
our Gibbs updates are given as: 


P{0ik I • • • ) oc Dirichlet ( 7 *^ + ^ S ijk + ^ R ijk ) 

3 3 

P(Sijk \ • . • ) °c Multinomial (p) 

Ph = O ikh B M J^( 1 - B hi *p~ Y ^ 

P(Rji\ . . . ) oc Multinomial ( q ) 

qh = 0 ikh B m * h Yiik ( 1 - 

Bgrnk tX Beta {ci(/ k 'y ' Yij k , bmk T y ^ Yijk), 


(22.14) 


where I* k is the group membership indicated by Rji k and m* k is the group membership indicated 
by Sij k - Again, let ( ijk )* be a specific subset of * = 1, .., n k and j = 1, .., n k such that Sij k = t 
and Rji k = m. Again, we incorporate a sparsity parameter p to account for the absence of ties due 
to lack of interaction. 

For the intervention and covariate examples (22.11) and (22.13), respectively, the additional 
parameter a uses Metropolis or Metropolis-Hastings updates. For example, if we use a random 
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walk method for proposing new values of a, we accept a s+l with probability min{ 1, R} where 
R = \ Note that R is a function of all K networks. 

For models with additional levels of hierarchy we can update additional parameters using 
Metropolis within Gibbs steps. For example, in (22.12), (3 is updated using Gibbs steps, 

K 

(3 oc Gamma ( Kt + ag, E 7fc + bp) , 

k=l 


and we use Metropolis updates for 7/ c , Gt, and r. We update each 77, using an analogous Metropolis 
step as for the single network. We use a network-specific tuning parameter i/~k which is also the 
shape parameter and rate parameter of , ensuring the proposal distribution has mean at the 
current value of 7*,. Each G: is updated in the same way using a Dirichlet proposal distribution. 

To update r we use a Gamma proposal distribution not unlike those used for 7 k with shape 
parameter as the tuning parameter v T and rate parameter such that the proposal distribution has mean 
of the current value of r. Then the proposed value of t s+1 is accepted with probability min{ 1, II } 


where R = 


P(t (<,+ 1) !■■■) P(r (g) |-r ( ‘ ,+1) ) 
P(r( s )|...) P(r( s + 1 ) |r( s )) ’ 


22.3.4 Empirical Examples 

We present two examples to illustrate fitting HMMSBMs and use two simulated datasets, with and 
without a covariate. 

In the first example we demonstrate fitting an HMMSBM similar to the example given in (22.12) 
where each network has a network-specific Dirichlet hyperparameter, A*,, used to generate the mem- 
bership probability vectors. Our goal is to assess parameter recovery on three levels: the hyperpa- 
rameters of the distribution that generates A/,, the A/, : themselves, and the lower-level parameters, 
R , S, and B that determine the probability of a tie. 

We simulate data from 20 networks, each with 20 nodes and 4 groups using the following model 
to generate our first set of data: 


E ^ Bernoulli{Sijk RkRjik ) 
Sijk ~ Multinomial (0^, 1) 

Rjik ~ Multinomial(9jk, 1) 

Oik ~ Dirichlet^kO 
7 k ~ Gamma ( 10, 50) , 


where £ = (0.25, 0.25, 0.25, 0.25). The group-group tie probability matrix is defined as 

' 0.9 0.05 0.05 0.05' 

0.05 0.8 0.05 0.05 
B ~ 0.05 0.05 0.7 0.05 ' 

0.05 0.05 0.05 0.6 


(22.15) 


We constrain B to be the same for each network and select the hyperparameters with which to 
generate 7 7, to ensure small enough values for block structure with low variability. Figure 22.4 
shows adjacency matrices for these 20 networks. 

We fit the following HMMSBM to these data using the MCMC algorithm described in Sec- 
tion 22.3.3. We let £ = (0.25, 0.25, 0.25, 0.25) and use a sparsity parameter equal to f° r 

all K networks. The model is given as 

E ijk ^ Bernoulli', .S',:,/. BkRjik) 
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FIGURE 22.4 

Networks with 20 nodes generated from a HMMSBM with group membership probabilities from a 
network-specific Dirichlet parameter 7 ^ ~ Gamma{ 50, 10). 


T 
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FIGURE 22.5 

Posterior density for r and /?, the hyperparameters for the distribution of 77 for each network. The 
vertical lines mark the value used to simulate the data, and the 95% equal-tailed credible intervals 
are indicated with gray. Densities show good recovery of the true value of each parameter. 
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Sijk ~ Multinomial (9^, 1) 

Rjik ~ Multinomial(Ojk, 1 ) 

O ik ~ Dirichletijk^k) 

B U k ~ Beta(3, 1) 

~ Befa(l, 10) ^ TO 

Afc ~ Gamma (t, (3) 
t ~ Gamma ( 50, 1) 

P ~ Gamma(10, 1) . (22.16) 

We fit the model using our MCMC algorithm and run chains of length 30,000. We remove the 
first 5000 steps and retain every 25 th iteration for a posterior sample of size 1001. The posterior 
samples for r and 3 are illustrated as densities in Figure 22.5. The vertical lines show the true value 
for each parameter and the gray region indicates the 95% credible interval, suggesting accurate 
parameter recovery for r and /3. Similar plots for 77. (see Figure 22.6) for each network k depict the 
variability in both the true value of jk as well as the accuracy of recovery. 
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FIGURE 22.6 

Posterior densities for 77,. where k = 1, 20. The 95% equal-tailed credible intervals contain the 
true value of 7/;,. for all but one of the simulated networks, suggesting good recovery. 
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FIGURE 22.7 

Tie probability matrix as estimated by posterior means. Visual comparisons to Figure 22.4 suggest 
accurate estimation of tie probability. 
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We use predicted probability tie matrices to assess recovery of lower-level parameters (Fig- 
ure 22.7). Ties with high probability are shown as shades of red, orange, and yellow, and ties with 
low probability are shown as shades of blue and purple. Visual comparisons to Figure 22.4 reveal 
that the estimated tie probabilities align with the simulated data; between-pairwise ties are reflected 
as having higher probability in the fitted model than non-ties. We do note, however, that ties that 
exist across groups (those shown outside of the block structure) tend to have smaller estimated 
probabilities than ties within groups. 

The second simulation serves two purposes: to illustrate fitting a MMSBM with covariates and 
to provide a second example of fitting HMMSBMs. We generate data for 10 networks, each with 
15 nodes and 3 groups. We use a single edge-level indicator covariate Xij k , such that X l: j k = 1 
implies that individual i in network k and individual j in network k have the same characteristic 
and Xij}- = 0 otherwise. In the context of teacher relationships in school k, for example, X, t j k might 
represent teaching the same grade, serving on the same committee, having classrooms in the same 
wing of the building, etc. For these data, we randomly assigned each node to one of 5 groups, and 
X ijk = 1 if nodes belong to the same group. The formal model used to generate these data is: 


Y ijk ~ Bernoulli(p ijk ) 

exp{logit ( S ijk 1 logit ( B k )R jik ) + 4X ijk } 

'Pxjk — r T i 

1 + exp{logit (S ljk B k R jik )} 

Sij k ~ Multinomial(0 tk , 1) 

Rjik ~ M u Itinomial ( 9 :] k . 1) 

9 ik ~ Dirichlet( 7 fc £ fc ) 

B Uk ~ Beta( 12,4) 

Be mk ~ Beta{ 3, 30) t ^ m 

7 fe ~ Gamma( 10,60) , (22.17) 

where £ = (|, |). Priors for ^ k were selected to ensure small enough values for block structure 

with low variability. We use different priors for the diagonal entries of B k than the off-diagonal 
entries to model higher within-group tie probabilities. Hypeiparameters of these priors were chosen 
to yield high and low probabilities for the diagonal and off-diagonal entries, respectively, without 
extreme values of almost 0 or 1 . 

The adjacency matrices for each of the 10 simulated networks are shown in Figure 22.8. We 
expect to see more variability in the block structure in these networks as compared to the first 
simulation study for two reasons. Foremost, we have included a covariate with a strong effect so 
that there are now many more across-group ties. In addition, we have varied the group-group tie 
probability matrix B k by allowing these entries to both differ across networks and be generated 
(instead of deliberately chosen). As a result, the block structure that we do see varies across networks 
as the values of the diagonal entries of B k vary. 

We fit the following model on these simulated data: 


Yij k ~ Bernoulli(pij k ) (22.18) 

exp{logit (Sjj fe T logit ( B k )R jik ) + aX ijk } 

Jp'i j fc rr i 

1 + exp{logit (S ijk B k Rj ik )} 

Sij k ~ Multinomial (9ik , 1) 

Rjik ~ Multinomial (9 j k , 1 ) 

9 ik ~ Dirichlet^k^k) 

B U k ~ Beta( 12,4) 
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FIGURE 22.8 

Networks generated from a single covariate HMMSBM with group membership probabilities from 
a network-specific Dirichlet parameter 7*, ~ Gamma (60, 10). Despite a strong block structure spec- 
ified by small values of 7*,, the networks have many across-group ties due to the high value of the 
regression coefficient of X^. 


Bimk ~ Beta( 3, 30 ) £ m 
7 k ~ Gamma (t, ff) 
t ~ Gamma(l,0.1) 
fj ~ Gamma(6,0.1) 
a ~ NormaKf). 100) , 


where £ = (§, §, |). 

We run MCMC chains of length 30,000, remove the first 5000 iterations, and keep every 25 th 
step. With a posterior sample size of 1001, we assess parameter recovery. We begin with our high- 
level parameters. Figure 22.9 and Figure 22.10 show the posterior densities for a and r and /3, 
respectively. The true value of each parameter is indicated by a vertical line and 95% credible 
interval regions are shown in gray. 

The estimation of a is accurate, but the estimates for r and /3 are much less precise. The posterior 
distribution for r is centered at a higher value than the value of r = 10 used to generate the data. 
The distribution for /? is centered at a value slightly lower than the true value /3 = 60. Similarly, the 
distributions for each 7*. are skewed toward higher values. As shown in Figure 22.11, only 3 of the 
10 posterior samples for 7*. contain the true value in their 95% credible interval. We suspect the lack 
of block structure contributes to these biases even though the covariate was the primary influence 
for across-group ties in the data generation process. 
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FIGURE 22.9 

Posterior density for a, the regression coefficient in Equation (22.13). The true value of a is 4 and 
is displayed as the vertical line. The 95% equal-tailed credible interval is implied by the gray region 
and suggests good recovery. 
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FIGURE 22.10 

Posterior density for r (a) and ri (b) with the true values shown as vertical lines. The 95% equal- 
tailed credible intervals are implied by the gray region. Much of the posterior distribution for r falls 
to the right of the true value used to generate the data. 
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FIGURE 22.11 

Posterior densities for 7 *. where k = 1, ..., 10. The 95% equal-tailed credible intervals contain the 
true value of 7 ;,. for only one of the networks and overestimating the value of 7 /.. 



FIGURE 22.12 

Tie probability matrix as estimated by posterior means. Ties with high probability are shown as 
shades of red, orange, and yellow and ties with low probability are shown as shades of blue and 
purple. Pairwise probabilities align well with the original adjacency matrices (Figure 22.8). 


We plot the pairwise probability of a tie in each network in Figure 22.12. Due to the lack of 
block structure, visually comparing the predicted tie probabilities to the original dataset may seem 
inconclusive, but in fact predicted probabilities align well with the data. 

22.3.5 HMMSBM Extension: Sensitivity Analysis 

Given the small number of networks used in our simulations, we are interested in the extent to 
which our prior specification dominates our model fit. Recall from (22.15), we generated data with 
r = 10 and 3 = 50 and in the model fit illustrated in Section 22.3.4, we used the following prior 
distributions: 


t ~ Gamma( 10, 1) 

/3 ~ Gamma (50, 1) . 

We repeat model estimation twice using a less strong prior and a weak prior, such that both are 
centered at the true values. The moderate and weak priors used are given as 
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r ~ Gamma( 1, 0.1) 
P ~ Gamma ( 5, 0.1), 


r ~ Gamma ( 0.1, 0.01) 

P ~ Gamma( 0.5, 0.01) . 

We first compare the posterior distributions for r and p. While the posterior distribution for 
t and P contain the true values for each fit, the variance of the posterior sample increases as the 
variance of prior distribution increases (Figure 22.13). We do note that the scale of the increase is 
less than the 10-fold increase of the prior distribution variance (50, 500, and 500 for p and 10, 100, 
and 100 for r). The posterior mean for r varies little as the prior changes: 8.5, 8.2, and 10.5 under 
a strong, less strong, and weak prior, respectively. The posterior mean for p is much less accurate 
when the weak prior is used. The respective means are 51.1, 51.8, and 120.3. 

To assess the prior distributions, we compare the 95% credible regions posterior distribution for 
7 k , k = 1, .., 20 with the true values (Figure 22.14). We notice the following patterns: if a 95% 
credible interval 77 does not cover the true value when the prior distribution is strong, it fails to 
cover the true value when the prior is moderate or weak. There is little difference in parameter 
recovery between the strong prior and the less strong prior. The weak prior fit recovers few of the 
7 /, well, and is strongly biased toward smaller values of 7 *.. 

Finally, we are interested in how these differences translate to tie probabilities. Figure 22.15 
shows the adjacency matrix and posterior mean of the pairwise probability of a tie determined for 
each model fit. We also include a measure of variability, the width of the 95% credible interval for 
each pairwise tie probability. For brevity, we only show the first five networks from the data. The 
posterior pairwise tie probability varies little across each fit and what is even more surprising is that 
the 95% credible interval widths also vary little. 

Based on this simple sensitivity analysis, we offer several conclusions. The prior specification of 
the high-level parameters r and 3 have moderate influence of mid-level parameters 7 & and very little 
influence on low-level parameters R, S, and /i, even with poor recovery of mid-level parameters. 
Furthermore, using a prior with much larger variance does not necessarily increase the variability in 
the low-level parameter estimates. 
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FIGURE 22.13 

A comparison of posterior distributions for r and P given three different prior gamma distribu- 
tions. Hyperparameters are (10.1), (1,0.1), (0.1, 0.01) and (50,1), (5,0.1), (0.5, 0.01) for r and /3, 
respectively, and plots are shown top to bottom. 
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(a) 


(b) 


(c) 


FIGURE 22.14 

For each choice of prior specification, strong (a), less strong (b), and weak (c), 95% credible inter- 
vals for 7 /. are shown in black and the true value of 7 /- is shown in green. Parameter recovery is 
good when the strong or less strong priors are used but is poor when the weak prior is used. 
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FIGURE 22.15 

The adjacency matrix (a) can be compared to the posterior tie probabilities for three model bts that 
vary by prior distribution specification for r and B. Priors for each parameter have the same mean 
but increase in variance by a factor of 10. The width of the 95% credible interval serves as a measure 
of variability. 
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22.4 Discussion 

We have presented the hierarchical network models (HNM) framework for modeling ensembles of 
networks and introduced the hierarchical mixed membership stochastic blockmodel (HMMSBM) 
as an example of an HNM for networks with subgroup structure. This fills a substantial method- 
ological void: while both single and mixed membership stochastic blockmodels have been used to 
incorporate grouping structure into models for relational data, very little prior work has focused on 
jointly modeling an ensemble of networks, and none of that work has focused on blockmodels. In 
addition we have presented a method for incorporating tie covariates into these models, addressing 
another void in the literature. 

We presented several examples of HMMSBMs to demonstrate both the generality and wide 
utility of these models. We used two simulated datasets, one with a covariate and one without a 
covariate, to illustrate model fitting using our MCMC algorithm. Posterior tie probabilities from our 
fits align well with simulated true ties and non-ties, and in most cases parameters were recovered 
well. High-level parameters, those furthest away from the data, were recovered with less consis- 
tency in the simulation study involving tie covariates. Finally, we investigated the effects of prior 
specification and found that, as expected, high-level parameters were most affected by choice of 
prior but that priors had little influence on predicted tie probabilities. 

With respect to the class of HMMSBMs and model fitting, our work reveals several areas for 
future work. Ties perhaps can form independently of subgroup structure due to common attributes; 
including covariates to account for this should produce preferable models. An important area for 
future research is understanding how the covariate effects and block effects interact with each other. 
Finally, high-level parameter estimates seem to depend strongly on hyperpriors, suggesting that 
estimation of these parameters is not yet data-dominated. Understanding how this situation improves 
as more networks (and perhaps larger networks) are added to the ensemble is also clearly important. 
On the other hand, it appears that priors have little effect on the low-level tie probabilities. 

We have illustrated a proof of concept for HMMSBMs and the HNM framework in general. 
HMMSBMs are appropriate models for ensembles of networks with block structure and can be fit 
using relatively simple methods. The HNM framework is larger than HMMSBMs alone since most 
single network statistical models can be extended to model an ensemble of networks. Sweet et al. 
(2013) introduced hierarchical latent space models as a class of HNM models, and the authors are 
currently working on extending work done by Zijlstra, van Duijn and Snijders (2006) and Tem- 
plin. Ho, Anderson and Wasserman (2003) for hierarchical exponential random graph models, and 
relating it to the general HNM framework. 
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Time-evolving networks are a natural representation for dynamic social and biological interactions. 
While latent space models are gaining popularity in network modeling and analysis, previous works 
mostly ignore networks with temporal behavior and multi-modal actor roles. Furthermore, prior 
knowledge, such as division and grouping of social actors or biological specificity of molecular 
functions, has not been systematically exploited in network modeling. In this chapter, we develop 
a network model featuring a state-space mixture prior that tracks complex actor latent role changes 
through time. We provide a fast variational inference algorithm for learning our model, and validate 
it with simulations and held-out likelihood comparisons on real-world, time-evolving networks. 
Finally, we demonstrate our model’s utility as a network analysis tool, by applying it to United 
States Congress voting data. 


489 



490 


Handbook of Mixed Membership Models and Its Applications 


23.1 Introduction 

Social and biological systems can often be represented as a series of temporal networks over actors, 
and these networks may undergo systematic rewiring or experience large topological changes over 
time. The dynamics of these time-evolving networks pose many interesting questions. For instance, 
what are the latent roles played by these networked actors? How will these roles dictate the way two 
actors interact? Furthermore, how do actors play multiple roles (multi-functionality) in different 
social and biological contexts, and how does an actor’s set of roles evolve over time? By knowing 
which actors play what roles as well as the relationships between different roles, we can gain insight 
as to how social or biological communities form in networks. For example, we can elucidate how 
actors with diverse role compositions group together, and how these groupings change over time. 

In particular, we want network actors to be capable of multiple roles, because assuming a single 
role per actor may simply be too restrictive. As an example, consider a social network composed 
of working adults. We can imagine the participants play at least two roles: one when at work (say, 
being a manager or a worker), and one when at home (perhaps a parent, or possibly unmarried). 
These two classes of roles are orthogonal to each other, thus one cannot account for all network 
behaviors with just one class. 

The time-evolving aspect of the network is equally important — we do not expect each actor’s 
roles to remain static over time, but anticipate that they will change, giving rise to rewiring in the 
network. Returning to the previous example, we might imagine a newly-pregnant mother increasing 
her “parent” role, or a promoted employee shifting from worker to manager. In fact, multiple roles 
could change at once — a working father caught in an accident would be less active both as a worker 
and as a parent, for instance. 

A final, crucial assumption is that the relationships between roles remain constant over time, 
like how a manager always delegates work to a subordinate, or how parents are always involved 
in raising children. This static relationship between roles provides a reference point for actor role 
mixtures to evolve over time; it is difficult to interpret actor role changes if the roles themselves are 
also changing! In fact, allowing both actor roles and role relationships to change arbitrarily makes 
for an ill-posed problem; it becomes unclear if a given network change should be explained in terms 
of actor roles or role relationships, or even a combination of both. 

In this chapter, we present a mixed membership solution to understanding time-evolving net- 
works, which we call a dynamic mixture of mixed membership stochastic blockmodels (dM 3 SB). 
This model employs the regular mixed membership stochastic blockmodel (MMSB) as the basic 
building block, but augments it with a multi-modal mixture prior that captures each actor’s role- 
mixture trajectory in a statistically flexible manner. Essentially, we conjoin the MMSB with a set of 
state-space models, one over each mixture component, and each state-space trajectory corresponds 
to the average evolution of the role mixtures of a group of actors. 

Compared to MMSB, this evolving mixture prior presents additional challenges to parameter 
learning and latent variable inference. We overcome these difficulties by developing a variational 
EM algorithm inspired by ideas from Ghahramani and Hinton (2000) and dMMSB (an earlier ver- 
sion of dM 3 SB) (Xing et ah, 2010), which allow for efficient approximate inference and parameter 
learning. In the following sections, we first develop the dM 3 SB model and variational EM algorithm, 
after which we present validation experiments on both synthetic and real data. Finally, we conclude 
with a demonstration of dM 3 SB towards analyzing voting data from the United States Congress. 
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23.2 Related Work 

There is increasing interest in employing latent space models for network analysis 1 (Hoff et al., 
2002; Handcock et al., 2007; Heaukulani and Ghahramani, 2013; Soufiani and Airoldi, 2012), of 
which dM 3 SB is one kind. However, most of these models assume static networks and a single, fixed 
role for each actor. Hence, they cannot model evolution of multiple actor roles over time, making 
them unsuitable for analyzing complex temporal networks. 

With respect to addressing these issues, Airoldi et al. (2008) provided a foundation with MMSB 
which permits actors to have role mixtures instead of single roles. Later, Xing et al. (2010) developed 
a dynamic extension of MMSB, called dMMSB, which addresses temporal evolution of actor role 
mixtures. The dMMSB places a time -evolving, unimodal prior on all network actors; specifically, it 
employs a time-evolving logistic normal distribution similar to a state-space model. 

Although an important first step towards dynamic network analysis, dMMSB offers very weak 
modeling power — because it employs a unimodal logistic normal for the role distribution of all ac- 
tors, it is only applicable to networks where the role mixtures of all actors follow similar, unimodal 
dynamics. A direct solution might be to introduce a separate dynamic process for each actor, but 
not only is this computationally impractical for large networks with many actors, it is also statisti- 
cally unsatisfactory from a Bayesian standpoint as the actors no longer share any common pattern 
and coupling, leaving the model prone to over-fitting and unable to support activity and anomaly 
detection. 

This challenge naturally leads us to explore “evolving clusters” of actors. By modeling dynamic 
processes on clusters, rather than on individuals or on the whole network, we can increase inferential 
power while retaining a common, yet expressive, multi-modal mixture model prior over each actor. 
Such a prior allows dM 3 SB to accommodate the non- stationary and heterogeneous behaviors of 
actors. 


23.3 Problem Formulation 

We consider a sequence of interaction networks or graphs, denoted by {Q^}f =1 , where each G <f> = 
{V, £(*)} represents the network observed at time t. We assume the set of actors V = {1, . . . , N} 
is constant. Furthermore, we permit the set of interactions between actors, to 

evolve with time. We ignore self edges . 

Our goal is to infer the time-evolving actor role mixtures that give rise to this network sequence. 
An actor’s role mixture is essentially a probability distribution over network roles. For example, a 
person in a social network could be 0.5 manager and 0.5 parent, meaning that half of his interactions 
(and non-interactions) can be explained in terms of manager role behavior, while the other half can 
be explained in terms of parenting behavior. The precise definition of an actor role mixture will be 
made clear later. 

We approach this problem by extending the mixed membership stochastic blockmodel (MMSB) 
(Airoldi et al., 2008), a static network model that treats each actor as having a mixture of net- 
work roles. The key modification is the addition of a time-evolving (i.e., dynamic) prior on top of 
the MMSB, which allows it to account for temporally-evolving network dynamics. This prior is 
a mixture of time-evolving logistic normal distributions, which is multi-modal, time-evolving, and 

'Also, see the chapter entitled "Mixed Membership Blockmodels for Dynamic Networks with Feedback” (Cho et al., 
2014 ). 
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captures correlations between roles. In particular, it is similar to the factorial hidden Markov model, 
for which variational inference techniques have been developed (Ghahramani and Hinton, 2000). 
With this prior, the resulting MMSB model is able to fit complex, time-evolving data densities that 
the static, unimodal, uncorrelated Dirichlet prior used in MMSB cannot. 


23.4 Time-Evolving Network Models 

Rather than directly introduce the full dM 3 SB model, we shall start by introducing the regular 
MMSB, and gradually extend it to become dMuSB. We hope that this presentation will not only be 
easier to understand, but will also make the connection between MMSB and dM 3 SB more clear. 

23.4.1 The Mixed Membership Stochastic Blockmodel (MMSB) 

We begin by describing the mixed membership stochastic blockmodel (Airoldi et al., 2008), which 
serves as the foundation for our model. The MMSB is a static network model, meaning that we only 
consider one network £ = (e, Furthermore, it assumes each actor v, £ V possesses a latent 

mixture of K roles, which determine observed network interactions. This role mixture formalizes 
the notion of actor multi-functionality, and we denote it by a normalized K x 1 vector 7 r^, referred 
to as a mixed membership or MM vector. We assume these vectors are drawn from some prior p{it). 

Given MM vectors tt,. tt, for actors i and j, the network edge e, :i is stochastically generated as 
follows: first, actor i (the donor ) picks one role z_hj ~ p(z |7Tj) to interact with actor j. Next, actor j 
(the receiver ) also picks one role z<_jj ~ p(z\nj) to receive the interaction from i. Both z_^j, z^ l3 
are AT x 1 unit indicator vectors. Finally, the chosen roles of i, j determine the network interaction 
e ij ~ P{ e \ z -tiji z <-ij )> where £ {0, 1}. The specific distributions over Z-^ij, z^-ij, e 3 j are: 

• z^.ij ~ Mult inorrii al ( 7r, ) . Actor z’s donor role indicator. 

• Zj-ij ~ Multinomial)^). Actor j’s receiver role indicator. 

• e-ij ~ Bernoulli (zZ^, j Bz^jj ) . Interaction outcome from actor i to j , 

where B is a K x K role compatibility matrix. Intuitively, the bilinear form z\ l;j Bz^ l3 selects a 
single element of /j; the indicators Z-^ij, z^, :/ behave like indices into B. 

This generative model has two noteworthy features. First, observed relations £ result from actor 
latent roles interacting. In the case of social networks, the latent roles are naturally interpretable 
as social functions, like manager, worker, parent, or single adult. Note that actor i’s latent mem- 
bership indicators j., z<_u are unique to each interaction ; he/she may assume different roles for 
interacting with each actor. 

Second, the role compatibility matrix B completely determines the affinity between latent roles. 
For example, a diagonally-dominant B signifies that actors of the same role are more likely to 
interact. Conversely, off-diagonal entries in B suggest interactions between actors of different roles. 
The MMSB’s expressive power lies in its ability to control the interaction strength between any pair 
of roles, by specifying the corresponding entries of B. 

An Example 

We now provide a simple example to explain how MMSBs generate interactions. Say we have two 
social network actors i,j, with MM vectors: 


• n i = [parent = 0.3, worker = 0.7], 
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• 7 Tj = [child = 1]. 

Let us assume that i is the biological father of j, and that the presence or absence of the directed 
edge Cjj signifies whether i has given orders to j. Finally, suppose that the role compatibility matrix 
has the following entries: 

• -^parent, child = 0.5, 

• -^worker, child = 0.01, 

where we ignore the other entries of B as they are irrelevant to this discussion. Intuitively, this B 
reflects how people acting as parents are likely to order their children to do things, whereas people 
acting as (office) workers are unlikely to interact with children at all. Then, the probability of c t:) = 1 
is computed as: 

p(eij = 1 I TTi.TTj.B) 

— ^ ^ V^ij — 1 I | 7T^) | ') 

i z * — ij 

= p(eij = 1 | Z-tij = parent, z<-ij = child, B ) 

x p(z^ij = parent | 7 Ti)p(z<-ij = child | 7 Tj) 

+ p(eij = 1 I Z-tij = worker, z^j = child, B) 
x p(z^ij = worker | n.i)p(z<-ij = child | 7 Tj) 

= (0.5)(0.3)(1) + (0.01)(0.7)(1) 

= 0.15 + 0.007 
= 0.157. 

We see that most of the interaction probability comes from the parent — »• child relationship, rather 
than the worker — > child relationship. 

23.4.2 Mixture of MMSBs (M 3 SB) 

The actor MM prior p{n) significantly affects MMSB’s expressive power. In the previous section, 
we say that MMSB uses a Dirichlet prior, which is conjugate to the multinomial role indicator 
distribution p{z\tt). The advantage of this conjugacy is that one can derive a clean variational infer- 
ence algorithm (Airoldi et al., 2008). However, a Dirichlet prior over roles is fairly restrictive in a 
statistical sense: it is not multi-modal and cannot capture correlations between roles. 

To overcome these shortcomings, we shall extend the MMSB by making p{ 7r) a logistic normal 
mixture prior , which is both multi-modal (due to the mixture) and permits correlations (due to the 
normal distribution). This adds the following generative process over the MM vectors 7r: 

• Ci ~ Multinomial((5). Mixture component indicator. 

• 7 j ~ Normal(/x Ci , £ Ci ). Unnormalized MM vector. 

• 7 Tj = Logistic( 7 j). Logistic-transformed MM vector, where [Logistic( 7 )] fc = ^ . 

Combining this generative process over 7 r with the MMSB model gives rise to what we call a 
mixture of MMSBs (M 3 SB). Here, c t is a C x 1 cluster selection indicator for tt,, where C is the 
number of mixture components. Thus, 77 is drawn from a logistic normal distribution with mean 
and covariance selected by c„ t , while c, itself is drawn from a prior multinomial distribution <5. 

The M 3 SB accounts for role correlations using its logistic normal distribution, and has the flexi- 
bility to fit complex data densities by virtue of its multi-modal mixture prior. In the next section, we 
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shall exploit these properties to design a time-varying network model that tracks the role mixture 
trajectories of clusters of actors. This is in contrast to the dMMSB model of Xing et al. (2010), 
which tracks a single, average trajectory. 

23.4.3 Dynamic M 3 SB (dM 3 SB) 

In a time-evolving network, we assume that the actor MM vectors tr^ and their prior pW (7r) change 
with time, and the goal is to infer their dynamic trajectories. Inferring the dynamic actor MM vectors 
allows us to detect large-scale temporal network trends, particularly groups of actors whose MM 
vectors tt shift from one set of roles to another. For example, if a company suddenly goes out of 
business, then its employees will also shift from the “worker” role to the “unemployed” role. 

In order to model time-evolution in the network, we place a state-space model on every logistic 
normal distribution in the mixture prior p(tr), similar to a Kalman filter. Let N denote the number 
of actors and T the number of time points in the evolving network. Also, let K denote the number 
of MMSB latent roles and C the number of mixture components. We begin with an outline of our 
full generative process; see Figure 23.1 for a graphical model representation. 



FIGURE 23.1 

Graphical model representation of dM 3 SB. 
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1 . Mixture State-Space Model for MM Vectors 

• /j/f ' ■* ~ Normal^, $) for ft, = 1 ... C. Mixture means for the MM prior at t = 1. 

• pp ~ Normal(// t_ L , $) for ft = 1 . . . C, t = 1 . . . T. Mixture means for t > 1. 

2. Mixture Component Indicators 

• { c i}?L i ~ Multinomial (ft) for t = 1 . . . T. Mixture indicator for each MM vector. 

3. Mixed Membership Stochastic Blockmodel 

• {7.^ ~ Norma^/i^J) , ^ c (o) f° r t = 1 . . .T. Unnormalized MM vectors according 
to the mixture indicated by <7- . 

• 7r-^ = Logistic(7,-^), [Logistic(7)] fc = { 7l } ' Logistic transform 7^ into MM 

, (t) 

vector 77 . 

• For every actor pair ( i , j / 1 ) and every time point t - 1 . . . T: 

- ~ Multinomial^-^). Actor i’s donor role indicator. 

- ~ Multinomial (7rj^). Actor j ’s receiver role indicator. 

- ef) ~ Bernoulli(z l \ 3 J Bzp^). Interaction outcome from actor i to j. 

We refer to this model as the dynamic mixture of MMSBs (dM 3 SB for short). The general idea is 
to apply the state-space model (SSM) used in object tracking to the MMSB model. Specifically, the 
MMSB becomes the emission model to the SSM; a distinct MMSB model is “emitted” at each time 
point (Figure 23.1). Furthermore, the SSM contains C distinct trajectories /(/, , each modeling the 
mean trajectory for a subset of MM vectors The SSM has two parameters v. <f>, representing 
the prior mean and variance of the C trajectories. Each trajectory evolves according to a linear 
transition model = A/j ^ ^ \ where A is a transition matrix and v//'' 1 ~ Normal(0, <I>) 

is Gaussian transition noise. We assume A to be the identity matrix, which corresponds to random 
walk dynamics; generalization to arbitrary A is straightforward. 

Each MM vector tt^ 1 is then drawn from one of the C trajectories The choice of trajectory 
for 77 is given by the indicator vector cf \ which is drawn from some prior. For simplicity, we have 
used a single multinomial prior with parameter S for all cf \ Observe that cp can change over time , 
allowing actors to switch clusters if that would fit the data better. Given cp , the MM vector np is 
drawn according to jCJ\f(p% , E (*)), where the variances Ei, . . . , Ec are model parameters. CAT 

denotes a logistic normal distribution, the result of applying a logistic transformation to a normal 
distribution. 

Once {77^}^ have been drawn for some f, the remaining variables zp t - , zp- , ep follow the 
MMSB exactly. We assume the role compatibility B to be a model parameter, although we note 
that more sophisticated assumptions can be found in the literature, such as a state-space model prior 
(Xing et ah, 2010). 


23.5 dM 3 SB Inference and Learning 

As with other mixed membership models, neither exact latent variable inference nor parameter 
learning are computationally tractable in dM 3 SB. The mixture prior on p f \ a factorial hidden 
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Markov model, presents the biggest difficulty — it is analytically un-integrable, its likelihood is sub- 
ject to many local maxima, and it requires exponential time for exact inference. Moreover, its logistic 
normal distribution does not admit closed-form integration with the multinomial distribution of z\tt. 
Finally, the space of possible discrete role indicators z is exponentially large in the number of actors 
N and time points T. 

We address all these difficulties with a variational EM procedure (Ghahramani and Beal, 2001) 
based on the generalized mean field (GMF) algorithm (Xing et ah, 2003), and using techniques from 
Ghahramani and Hinton (2000) and dMMSB (Xing et ah, 2010). Our algorithm simultaneously 
performs inference and learning for dM 3 SB in a computationally-effective fashion. 

Throughout this section, we shall present just the final dM 3 SB update equations. For more thor- 
ough derivations, the interested reader is referred to the Appendix. 

Briefly, variational inference attempts to approximate the true posterior distribution with 
a simpler factored distribution on which inference is computationally more tractable. Let 
0 = {v, <f>, {'Eh} < h = i, <5, B} denote all model parameters. We approximate the joint poste- 
rior p({z^\ 7 W, c^\ }h=i}t=i | by a variational distribution over factored 

marginals. 


q = Qn 


T,N 

N 

n 

<h('yi t) )Qc(Ci t) ) n Qz(z%,z%) 

t,i= 1 

. 7=1 


The variational factors q z , q- , and q c are the marginal distributions over the MMSB latent vari- 
ables z, 7 , and mixture indicators c, respectively. The last variational factor q fl is the marginal distri- 
bution over the mixture of C SSMs over time. The idea is to approximate latent variable inference 
under p (intractable) with feasible inference under q. In particular, Ghahramani and Hinton (2000) 
have demonstrated that it is feasible to have one marginal <y ; , over all //s. 

The GMF algorithm maximizes a lower bound on the marginal distribution p({S^}f =1 ;Q) 
over arbitrary choices of q z , g 7 , q c , q t , . We use the GMF solutions to the variational distributions q 
as the E-step of our variational EM algorithm, and derive the M-step through direct maximization 
of our variational lower bound with respect to 0. Under GMF, the optimal solution to a marginal 
g(X) for some latent variable set X is p(X| Y, E 9 [<j)(A4Bx.)]), the distribution of X conditioned on 
the observed variables Y and the expected exponential family sufficient statistics (under variational 
distribution q) of X’s Markov blanket variables (Xing et ah, 2003). Hence, our E-step iteratively 
computes g(X) := p{X\{£^}f =1 ,E q [(j>(MBx.)]) forX = {u ( * ] }ff, cf and {z^, z^}. 

For brevity, we present only the final E-step equations; exact derivations can be found in the Ap- 
pendix. 


E-step for q z : 

From here, we drop time indices t whenever appropriate. q~ is a categorical distribution over K 2 
elements. 


q z {z^,ij = k,z<-ij = l) ~ Multinomial^^)), (23.1) 

^(ij)kl 53 (1 Bki') l t exp((7 tk) T 

where oJ^j) is a normalized K 2 x 1 vector indexed by (fc, l). 2 The notation ( X ) denotes the expec- 
tation of X under q; for example, the expectations of z under q z are {z^i^f) UJ (ij)ki an d 

( ) := Sfc w (ij)fci- 

2 k , / correspond to roles indicated by Zi->j , Zi<-j. 
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E-step for q 7 : 

q~/ does not have a closed form, because the logistic-normal distribution of 7 is not conjugate to 
the multinomial distribution of 2. We apply a Laplace approximation to q 7 , making it normally dis- 
tributed (Xing et ah, 2010; Ahmed and Xing, 2007). Define d/(a, b,C ) ;= exp{ — |(a— b) T C~ 1 (a— 
6)}. The approximation to </, is 


g 7 (74 ) oc T 1 (7^ , n , Aj ) , where 

( c \ 

A* = (2N~2)H i +Y,^h 1 (c ih ) 


h=l 


N 


(23.2) 


n=u + + (zi-ji)) 

jAi 

— (2N — 2) (gi + Hiiu-ii))}, 


7 i is a Taylor expansion point, and g, and Hi are the gradient and Hessian of the vector-valued 
function log(^ (=1 exp yf) evaluated at 7* = 7*. We set 7* to (7 f) from the previous E-step iteration, 
keeping the expansion point close to the current expectation of 7 j. 


E-step for q c : 

q c is discrete over C elements, 

q c {ci = h) oc 6 h IXft p 172 exp |-^tr [X^ 1 ((7*7,^) 

-{l^h){li) T ~ {'y i )(ph) T + {p, h lil))] }• 

Note the dependency on second order moments (7*7^) and (ThPh)- Since <37, are Gaussian, 
these moments are simple to compute. 


E-step for q g: 

The GMF solution to <y ;i factors across clusters h: 


% (ivh^i’h ) ■= n > where 

h = 1 

Qn,h (EEf) « 

T 

fEV, <f> )° b ( 1 ’ h ) n * E _1) - &)> 

t = 1 


Ob (t, h) := 


V Ef=i(4 } ) ’ ,1 , eE< 4 v' 


(23.3) 


Notice that factor resembles a state-space model for cluster h, with “observation 

probability” at time t proportional to Ob (h,t). Hence the mean and covariance of each p can be 
efficiently computed using the Kalman smoothing algorithm. 
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Input: Temporal sequence of networks {Q^}t=v 

Output: Variational distributions q z , g 7 , q c , q M and model parameters B, 8 , v, <f>, {^h}h=i- 
Initialize parameters B, 8, v, <f>, {'Bh}h=v 
Sample initial values for , 7 W ,c^\ 

repeat 

repeat 

Update q z {zfX.j, 4Tj) for all i,j , t. 

Update B. 

Update g 7 ( 7 ^) for all z, t. 
until convergence 

Update q^ip^Xl'Zi)- 
Update v, <f>. 

Update q c (XP) for all i,t. 

Update <5, {E;, }^ =1 . 
until convergence 

Algorithm 1: Variational EM for dM 3 SB. 


23.5.1 Parameter Estimation (M-step) 


Given GMF solutions to each q from our E-step, we take our variational lower bound on the log 
marginal likelihood, and maximize it jointly with respect to all parameters 0 (for details, refer to 
the Appendix). Let S(A) := A + A 1 . The parameter solutions are: 


Pki •= 


Ei’i 


T,N,N (t) 


(t) 


jAi (ij)kl ij 


- £ K 


$ := 


1 

TC 

T 


\T,N,N (t) ’ 

st,i,j^i ij)kl 

£(Ml 1 >rf ,T >-s((rfV T ) 

h= 1 


c 


+£(rfVr>-s(<"!.‘W*- ,,T )) 


t = 2 


, „ ^ / C W 


TN 


-T 


(t-1) (t-l)T, 
h / 


S/, := 


s^< 4 ) )[(7? ) 7? )t ) - §((7f ) )(Mn T ) + Kvr ' )] 

Erf<4 } > 


Our full inference and learning algorithm is summarized in Algorithm 1. This algorithm in- 
terleaves the E-step and M-step equations, yielding a coordinate ascent algorithm in the space of 
variational and model parameters. The algorithm is guaranteed to converge to a local optimum in 
our variational lower bound, and we use multiple random restarts to approach the global optimum. 
Similar to the MMSB variational EM algorithm (Airoldi et ah, 2008), we update q z . c/ 7 , and B 
more frequently for improved convergence. Note that each random restart can be run on a separate 
computational thread, making dM 3 SB easily parallelizable and therefore highly scalable. 


23.5.2 Variational Inference 


23.5.3 Suitability of the Variational Approximation 

Given that our true model is multi-modal, our variational approximation will only be useful if it 
also fits multi-modal data. Historically, naive mean field approximations, such as used in latent 
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space models such as MMSB (Airoldi et al., 2008) and the latent Dirichlet allocation (Blei et al., 
2003), approximate all latent variables with unimodal variational distributions. These unimodal dis- 
tributions are unlikely to fit multi-modal densities well; instead, we employ a structured mean field 
approximation that approximates all /js with a single, multi-modal switching state-space distribu- 
tion 5^0, essentially a collection of C Kalman filters. This ensures that the multi-modal structure 
of the prior on the MM vectors 7 ® is not lost. Moreover, although each q 1 ( y^) for a given i, t is 
a unimodal Gaussian, it can be fitted to any mode in (), independently of g 7 ( 7 ^ ) for other i, t. 
This flexibility ensures the variational posterior over all remains multi-modal. 


23.6 Validation 

To validate dM 3 SB, we need to show that it fits multi-modal, correlated, time-varying data better 
than alternative models. For this purpose, we shall compare dM 3 SB to its unimodal predecessor 
dMMSB (Xing et al., 2010), and show that it improves over the latter in multiple respects, on 
both synthetic and real-world data. Later, we shall conduct a case study on a real-world dataset to 
demonstrate dM 3 SB’s capabilities. 

In the experiments that follow, we ran our algorithm for 50 outer loop iterations per random 
restart, with 5 iterations per inner loop. We also fixed $ = Ik and S = 1/C instead of running 
their M-steps, as we found this yields more stable results. For the remaining parameters, we used 
their M-steps with the following initializations: B^i ~ Uniform(0, 1), = Ik- As for v, we ini- 

tialized {/if' ^ ) ~ Uniform([— 1, 1] K ) for all h and set v to their average. The remaining variational 
parameters were initialized via the generative process. 

23.6.1 Synthetic Data 

Previously, Xing et al. (2010) compared the performance of the dMMSB time-varying model against 
a naive sequence of disjoint MMSBs, one per network time point. In particular, when the roles 
are correlated, the logistic-normal prior provides a better fit to the data than the Dirichlet prior. 
Moreover, for time-varying networks, dMMSB provides a better fit than disjoint MMSBs on every 
time point. 

We now demonstrate that dM 3 SB’s multi-modal prior is an even better fit to time-varying 
network data than dMMSB’s unimodal prior. In this experiment, we shall compare dM 3 SBs to 
dMMSBs in terms of model fit (measured by the log marginal likelihood) and actor MM recovery. 
We generate data with N = 200 actors and T = 5 time points, and assume a K = 3 role compati- 
bility matrix B = (f?i, f? 2 , B$) T , with rows f?i = (1, .25, 0), B 2 = (0, 1, .25), and B$ = (0, 0, 1). 
The actors are divided into four groups of 50, with the first three groups having true MM vectors 
(.9, .05, .05), (.05, .9, .05) and (.05, .05, .9), respectively, for all time points. The last group has 
MM vectors that move over time, according to the sequence 7 = (.6, .3, .1), = (.3, .6, .1), 

70 3 ) = (.1, .8, .1), 7 0 4 ) = (.1, .6, .3), 70 5 ) = (.1, .3, .6). The generated B, MM vectors 7 r, and 
networks are visualized in Figure 23.2. 

Thus far, we have not addressed model selection — specifically, selection of the number of roles 
K and the number of mixture components (clusters) C. To do so, we performed a gridsearch over 
K £ {2,3,4, 5, 6} and C £ {1,2, 3, 4, 5} on the full network, using 200 random restarts per 
( K , C) combination. For all combinations, we observed convergence well within our limit of 50 
outer iterations. Furthermore, completing all 200 restarts for each K, C took between 8 hours (K = 
2, C = 1) and 28 hours ( K = 6, C = 5) on a single processor. Since the random restarts can be run in 
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FIGURE 23.2 

Synthetic data ground truth visualization. Top Row: Adjacency matrix visualizations, beginning on 
the left with t = 1 using random actor ordering, followed by t = 1, . . . , 5 with actors grouped 
according to the ground truth. Bottom left: The role compatibility matrix B , shown as a graph. 
Circles represent roles, and numbered arrows represent interaction probabilities. Bottom row: True 
actor MM plots in the 3-role simplex for each t. Blue, green and red crosses denote the static MMs 
of the first 3 actor groups, and the cyan circle denotes the moving MM of the last actor group. 


parallel, with sufficient computing power one could easily scale dM 3 SB to much larger time-varying 
networks with thousands of actors and tens of time points. 

For each (K . C ) from the gridsearch, we selected its best random restart using the variational 
lower bound with a Bayesian information criterious (BIC) penalty. The best restart BIC scores are 
plotted in Figure 23.3; note that dMMSB corresponds to the special case C = 1. The optimal BIC 
score selects the correct number of roles K = 3 and clusters C = 4, making it a good substitute for 
held-out model selection. 



K 


Synthetic 
. 5-fold avg. LL 

x 10 a 



dMMSB K=3 dM3SB K=3,C=4 


FIGURE 23.3 

Synthetic data: BIC scores and 5-fold heldout log-likelihoods for dM 3 SB and dMMSB. 


Next, using the BIC-optimal ( K . C ), we ran dM 3 SB on a 5-fold heldout experiment. In each 
fold, we randomly partitioned the dataset’s actors into two equal sets, and used the two correspond- 
ing subnetworks as training and test data. In each training fold, we selected the best model pa- 
rameters 0 from 100 random restarts using the variational lower bound. We then estimated the 
log marginal likelihood for these parameters on the corresponding test fold, using Monte Carlo 
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integration with 2000 samples. This process was repeated for all 5 folds to get an average log 
marginal likelihood for dM 3 SB . For comparison, we conducted the same heldout experiment for 
a dMMSB set to K from the optimal (K. C) pair. The average log marginal likelihood for both 
methods is shown in Figure 23.3, and we see that dM 3 SB’s greater heldout likelihood makes it a 
better statistical fit to this synthetic dataset than dMMSB. 

Finally, we compared dM 3 SB to dMMSB in role estimation ( IT) and actor role recovery (tt^ 1 ), 
using their best restarts on the correct (K, C) (or just K for dMMSB). Table 23.1 shows, for both 
methods versus the ground truth, the average f 2 error in tt^' — specifically, we compared the ground 
truth to tt^’s posterior mean from either method — as well as the total variation in B. dM 3 SB’s 
average f 2 error in tt f 1 is significantly lower than dMMSB’s, at the cost of a higher total variation 
in B. However, dM 3 SB’s total variation of 0.1083 implies an average difference of only 0.012 in 
each of the nine entries of B, which is already quite accurate. The fact that dM 3 SB accurately 
recovers ir^ confirms that its posterior over all ' is multi-modal, which validates our variational 
approximation. 


TABLE 23.1 

Synthetic data: Estimation accuracy of dM 3 SB (K - 3. C = 1) and dMMSB ( K = 3). 


dM 3 SB role matrix B, Total Variation 

0.1083 

dMMSB role matrix B, Total Variation 

0.0135 

dM 3 SB MMs mean (: 2 difference 

0.0266 

dMMSB MMs tt* 1 1 , mean ( 2 difference 

0.0477 


We also note that dM 3 SB’s mean cluster trajectories ) accurately estimated the four groups’ 
mean MM vectors with a maximum f 2 error of 0.0761 for any group h and time /:, except at t = 5, 
where dM 3 SB exchanged group 3’s trajectory with that of (moving) group 4. In conclusion, we have 
seen that dM 3 SB provides a better fit to this synthetic dataset than dMMSB, thanks to the former’s 
multi-modal prior. 

23.6.2 Real Data 

We now assess the model fitness of both dM 3 SB and dMMSB on two real-world datasets: a 151 actor 
subset of the Enron email communications dataset (Shetty and Adibi, 2004) over the 12 months 
of 2001, and a 100 actor subset of the United States Congress voting data over the 8 quarters of 
2005 and 2006 (described in the next section). As with the synthetic data, we shall use heldout 
log-likelihood to measure how well each model fits the data. 

For both datasets, we first selected the optimal values of ( K , C) via BIC score gridsearch with 
dM 3 SBoverA' € {3,4,5,6},C £ {2, 3, 4, 5}. Our previous synthetic experiment has demonstrated 
that model gridsearch using BIC produces good results. The optimal values were K = 4, C = 2 for 
the Senator dataset, and K = 3, C = 4 for the Enron dataset (Figure 23.4). 

Using each dataset’s optimal (AT, C), we next ran dM 3 SB on the 5-fold heldout experiment 
discussed in the previous section, obtaining average log marginal likelihoods. For comparison, we 
conducted the same heldout experiments for dMMSB set to K from the optimal (AT, C) pair. 

Plots of the heldout log marginal likelihoods for dM 3 SB and dMMSB can be found in Figure 
23.4. On the Senator dataset, dM 3 SB has the higher log marginal likelihood, implying that it is a 
better statistical fit than dMMSB. For the Enron dataset, both methods have the same likelihood, 
showing that using dM 3 SB with more mixture components at least incurs no statistical cost over 
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FIGURE 23.4 

Senator/Enron data: BIC scores and 5-fold heldout log-likelihoods for dVl 'SB and dMMSB. 


dMMSB. These results demonstrate that dM 3 SB’s multi-modal prior is a better fit to some real- 
world, time-varying networks, compared to dMMSB unimodal prior. 


23.7 Case Study: U. S. Congress Voting Data 

We finish our discussion with an application of dM 3 SB to the United States 109th Congress voting 
records. Here, we will show that dM 3 SB not only recovers MM vectors and a role compatibility 
matrix that matches our intuitive expectations of the data, but that the MM vectors are useful for 
identifying outliers and other unusual phenomena. 

The 109th Congress involved 100 senators and 542 bills spread over the dates January 1, 2005 
through December 31, 2006. The original voting data 3 is provided in the form of yes/no votes for 
each senator and each bill. In order to create a time-varying network suitable for dM 3 SB, we applied 
the method of Kolar et al. (2008) to recreate their network result. 

The generated time-varying network contains 100 actors (senators), and 8 time points corre- 
sponding to 3-month epochs starting on January 1, 2005 and ending on December 31, 2006. The 


3 Available at http://www.senate.gov. 
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network is an undirected graph, where an edge between two senators indicates that their votes were 
mostly similar during that particular epoch. Conversely, a missing edge indicates that their votes 
were mostly different. Our intention is to discover how the political allegiances of different senators 
shifted from 2005 to 2006. 

For our analysis, we used the optimal dM 3 SB restart from the BIC gridsearch described in 
the previous held-out experiment. Recall that this optimal restart uses K = 4 roles and C = 2 
clusters. The learned MM vectors 7r», compatibility matrix B, and most probable cluster assignments 
are summarized in Figure 23.5. The results are intuitive: Democratic party members have a high 
proportion of Role 1, while Republican party members have a high proportion of Role 2. Both Roles 
1 and 2 interact exclusively with themselves, reflecting the tendency of both political parties to vote 
with their comrades and against the other party. The remaining two roles exhibit no interactions; 
senators with high proportions of these roles are unaligned and unlikely to vote with either political 
party. Observe that the two clusters perfectly capture party affiliations — Republican senators are 
almost always in cluster 1, while Democratic senators are almost always in cluster 2. 

While it is reassuring to see results that reflect a contemporary understanding of U.S. politics, 
a more useful application of dM 3 SB’s mixed membership analysis is in identifying outliers. For 
instance, consider the Democrat Senator Ben Nelson (#75): from t = 1 through 7, his votes were 
unaligned with either Democrats or Republicans, though his votes were gradually shifting towards 
Republican. At t = 8 (the end of 2006), his voting becomes strongly Republican (Role 2), and he 
shifts from the Democrat cluster (1) to the Republican one (2). Sen. Nelson’s trajectory through the 
role simplex is plotted in Figure 23.6. Incidentally, Sen. Nelson was re-elected as the Senator from 
Nebraska in late 2006, winning a considerable percentage of his state’s Republican vote. 

Next, observe how the senator from New Jersey, #28, started off unaligned from t = 1 to 4 
but ended up Democratic from t = 5 to 8; his role trajectory is also plotted in Figure 23.6. There 
is an interesting reason for this: the seat for New Jersey was occupied by two senators during the 
Congress, Senator Jon Corzine in the first session (t = 1 to 4), and Senator Bob Menendez in the 
second session (t = 5 to 8). Sen. Corzine was known to have far-left views, reflected in #28’s lack 
of both Republican and Democratic roles during his term (the Democrat role captures mainstream 
rather than extremist voting behavior). Once Sen. Menendez took over, #28’s behavior fell in line 
with most Democrats. 

Other notable outliers include Senator James Jeffords (#54), the sole Independent senator who 
votes like a Democrat, and three Republican senators with Democratic leanings: Senator Lincoln 
Chafee #19, Senator Susan Collins #25, and Senator Olympia Snowe #89. These senators exhibit 
MM vectors that deviate significantly from their party average, which make them obvious outliers 
under even a simple K-means cluster. Through examining these outliers, dM 3 SB allows us to per- 
form anomaly detection and analysis. 

In summary, dM 3 SB provides a latent space view of the 109th Congress voting network, which 
reveals both expected aggregate trends (voting along bipartisian lines) as well as unexpected anoma- 
lies (senators who differ from their party norm). We anticipate that dM 3 SB can also be applied 
to understanding time-evolving biological networks, just as Xing et al. (2010) applied the earlier 
dMMSB model to such data in 2010. 
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Legend 



Evolving Mixed 
Membership vector 
(8 time points) 


Official Affiliation 
D - Democrat 
R - Republican 
I - Independent 

State 

Abbreviation 
e.g. HI = Hawaii 


Cluster at each time point 

1 - "Democrat" cluster 

2 - "Republican" cluster 


Role Compatibility Matrix 



Democrat Republican Centrist Centrist 

behavior Behavior Behavior Behavior 


1 D-HI 2R-TN 3R-CO 4 R-VA 5 D-MT 6 D-IN 7 R-UT 8 D-DE 9 D-NM 10 R-MO 


11111111 

12R-KS 


11111111 11111111 


13R-KY 14R-MT 15 R-NC 16 D-WV 17D-WA 18 D-DE 19 R-RI 


11111111 

20 R-GA 


22222222 11111111 11111111 11111111 21111111 
21 D-NY 22 R-OK 23 R-MS 24 R-MN 25 R-ME 26 D-ND 27 R-TX 28 D-NJ 29 R-ID 


22111111 
30 R-ID 


22222222 11111111 11111111 11111111 77777777 77777777 11111111 77777777 22111111 11111111 

31 D-MN 32 R-SC 33 R-OH 34 D-CT 35 R-NC 36 R-NM 37 D-ND 38 D-IL 39 R-NV 40 R-WY 


22222222 11111111 12112111 22222222 11111111 
41 D-WI 42 D-CA 43 R-TN 44 R-SC 45 R-IA 


46R-NH 47R-NE 48 D-IA 


11111111 

49 R-UT 


12221211 

50 R-TX 


77772277 22722777 11111111 11111111 11111111 11111112 11111111 22272227 11111111 11111111 

51 R-OK 52 D-HI 53 R-GA 54 l-VT 55 D-SD 56 D-MA 57 D-MA 58 D-WI 59 R-AZ 60 D-LA 


11111111 

22722777 

11111111 

22772227 

22272222 

72277227 

22277222 

22272227 

11111111 

72227222 

61 D-NJ 

62 D-VT 

63 D-MI 

64 D-CT 

65 D-AR 

66 R-MS 

67 R-IN 

68 R-FL 

69 R-AZ 

70 R-KY 









PH 


22772277 

22722277 

22772227 

22222227 

22272222 

11111111 

11111111 

11111111 

21117222 

11111111 

71 D-MD 

72 R-AK 

73 D-WA 

74 D-FL 

75 D-NE 

76 D-IL 

77 D-AR 

78 D-RI 

79 D-NV 

80 R-KS 








■ 

9 


22772227 

11111111 

22222227 

22222227 

22272221 

72222227 

22277222 


72227222 

11111111 


81 D-WV 82 D-CO 83 R-PA 84 D-MD 85 D-NY 86 R-AL 87 R-AL 88 R-OR 89 R-ME 90 R-PA 


22722777 11111122 


91 D-MI 92 R-AK 93 R-NH 94 R-MO 95 R-WY 96 R-SD 


11111111 11111111 72227222 11111112 

97 R-LA 98 R-OH 99 R-VA 100 D-OR 


11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 


FIGURE 23.5 

Congress voting network: Mixed membership vectors (colored bars) and most probable cluster as- 
signments (numbers under bars) for all 100 senators, displayed as an 8-time-point series from left- 
to-right. The annotation beside a senator’s number refers to that senator’s political party (D for 
Democrat, R for Republican, I for Independent) and state (as a two-letter abbreviation). Refer to the 
legend for specific details. The learned role compatibility matrix is displayed at the top right. 
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FIGURE 23.6 

Congress voting network 3-simplex visualizations. Colors (green, blue) denote cluster membership. 
Left: MM vector time-trajectory for Senator #28 (D-NJ) — Jon Corzine during time points 1-4 and 
Bob Menendez during time points 5-8. Right: MM vector time-trajectory for Senator Ben Nelson 
(#75, D-NE). 
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23.8 Conclusion 

d.M\SB is a probabilistic model for latent role analysis in time-varying networks, with an efficient 
variational EM algorithm for approximate inference and learning. This model is distinguished by 
its explict modeling of actor multi-functionalities (role MMs), as well as its multi-modal, time- 
evolving, logistic normal mixture prior over these multi-functionalities, which allows dM 3 SB to fix 
complex latent role densities. We also note that d.Vl\SB’s variational inference algorithm is trivial 
to run in parallel, since each random restart can be run on a separate computational thread. 

Notably, dM 3 SB is an evolution of the dMMSB (Xing et al., 2010) and MMSB (Airoldi et ah, 
2008) models, and shares much in common with them. Validation experiments show that dM 3 SB’s 
multi-modal prior outperforms the unimodal prior of dMMSB on both synthetic and real data, 
which underscores the importance of using statistically flexible priors. The most important uses 
of dM 3 SB are exploration of actor latent roles and anomaly detection, which were demonstrated in 
a case study on the 109th U.S. Congress voting data. 


Appendix 

Derivation of the Variational EM Algorithm 

This appendix provides detailed derivations of the dM 3 SB variational EM algorithm. Recall that our 
goal is to find the posterior distribution of the latent variables /j, c, 7 , z given the observed sequence 
network E^\ . . . , E IT> , under the maximum likelihood model parameters B, 5, v, <1' . and X. 

Finding the posterior (inference) or solving for the maximum likelihood parameters (learning) 
are both intractable under our original model. Hence we resort to a variational EM algorithm, which 
locally optimizes the model parameters with respect to a lower bound on the true marginal log- 
likelihood, while simultaneously finding a variational distribution that approximates the latent vari- 
able posterior. The marginal log-likelihood lower bound being optimized is 


log p(E | 0) = log [ p{E,X | G)dX 
Jx 

= log 


> 


IX 




(Jensen’s inequality) 


= Eg [logp (E, X I 0) - log q (X)] =: C (q, 0) , 

where X denotes the latent variables {p, c, 7 , z}, 0 denotes the model parameters {B, S, v, <f>, X}, 
and q is the variational distribution. This lower bound is iteratively maximized with respect to q’s 
parameters (E-step) and the model parameters 0(M-step). 

In principle, the lower bound C ( q , 0) holds for any distribution q; ideally q should closely 



Analyzing Time-Evolving Networks 


507 


approximate the true posterior p (X \ E,Q). In the next section, we define a factored form for q and 
derive its optimal solution. 

Variational Distribution q 


We assume a factorized form for q: 


q = 9 m 



T,N 


N 


■A T, )n 

q^ (7 { i ] ) 



t,i — 1 


jAi 


We now make use of the generalized mean field (GMF) theory (Xing et al. 2003) to deter- 
mine each factor’s form. GMF theory optimizes a lower bound on the marginal distribution 
p(E | 0) over arbitrary choices of q^, q 1 , q c , and q z . In particular, the optimal solution to qx is 
p ( X | E, E 9 [( t> (MBx)]), the distribution of the latent variable set X conditioned on the observed 
variables E and the expected exponential family sufficient statistics (under q) of X’s Markov blan- 
ket variables. More precisely, qx has the same functional form as p (X | £,MBx), but where a 
variational parameter V replaces <j> (V) for each Y € MBx, with optimal solution V := E g [<j> (V)]. 
In general, if Y € MBx, then we use (<i> ( Y ')) to denote the variational parameter corresponding to 
Y. 

We begin by deriving optimal solutions to q M , g 7 , q c , and q z in terms of the the variational param- 
eters (f (Y)). After we have derived all factors, we present closed-form solutions to (</> (Y)}. These 
solutions form a set of fixed-point equations which, when iterated, converge to a local optimum in 
the space of variational parameters (thus completing the E-step). 


Distribution of q z 


q z is a discrete distribution since the zs are indicator vectors. We begin by deriving the distribution 
of the zs conditioned on their Markov blanket: 



The variables ] belong to other variational factors, and their exponential family sufficient 

statistics are just 7 ^ and 7^ 1 themselves. Hence 


q z 

:cx exp jj5g> log + (l - E$>) log (l - Bz^ 
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with variational parameters 



and 


. We can also express q z in terms of indices k, l: 


q z (zfl 3 = k, zflj = l 


:oc exp | log B k j + (l - log (1 - B k j) + 


it) 

Til 


• l ?>} 


Distribution of 




q 1 is a continuous distribution. The distribution of 7^ conditioned on its Markov blanket is 

p ( 7 f | MB 7 ( t) ) 

N 

« P (7i t} I cj'U 0 , . . • (zQj I 7i* } ) P (zfli I 7i* )> 


jAi 


« ex P fe -^ C S - /4°) 1 (^ t} - 


N K 

n IT 1 y^AT (i) 

jy£ik = 1 V 2-/Z=l ex P /i,Z 


, x Z (t) , x *(*) 

(t) \ / (t) \ Z j<^z,k 

ex P7 a \ / ex P7. fc \ 


E -AT ( c . 

/=i exp 7i,j 


(t) 


= exp 




,A=1 

AT if 


:,z 


oc 


+ X) X] (77fc7$ + ^j,fc7i,fc) - (2JV - 2) log^ exp % ( 
fc=i i=i 

»p(E 4»S f(% w ) T s t'7. l,) (7f) T s;Vl'> - (/.?’) T stb? 


,/j=l 
N 


I< 


+ Tlj + 7i - ( 2N ~ 2 ) lo § exp %i ) ■ 


1 iAi 


1=1 


The variables cf\p[ L \ . . . , zf \ x , . . . , z^ N , z [*^, . . . , .zjvLi belong to other variational fac- 
tors. The sufficient statistics for variables 2 are just zf\ 3 and z^}_ l themselves. For variables c and 

/i, their sufficient statistics are and cfl ^/j P) j . However, since c is marginally independent of 
// under <7, we can take their expectations independently, hence the variational parameters are just 

c S) and (^ ) )- Hence 

<h (7 i t] ) -oc exp - * (city (7^) - (7^) S^ 1 7l (t) 

+ (^) + 7i 4) - (21V - 2) log^exp7 P 


^ , 
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with variational parameters > ( 2 j2-j)- 


Laplace Approximation to g 7 


The term i? 7 ^7 := log^E, exp 7^ makes the exponent analytically un-integrable, which 
prevents us from computing the normalizer for c/ 7 ^ 7 ^) • Thus, we approximate Z 7 with its 


second-order Taylor expansion around a chosen point 7! ' : 


( 7 f) - 7 ( 7 f , ) + (#) T (rl‘ l -% 1 * 

+ t(7<’ ) -#) T fff>(7f 


(t) o (t) 

7i - 7 i 


(t) ex P7j,fc 

@i,k ^-~\K * (i) 

Dfc/=iexp^/ 

(t) . I [*: = /] exp 7^5 exp 7^ exp 7^ 
i,fc/ _____ ( t ) /_ „ \ 


V exD 7 * (sr K 

^ k ‘ ’=! exp7 *,fe' (E fc '=i exp 7 -,fe/ J 

Note that Ll'f ' 1 = diag ^ ~ 9 ■ Because the variational EM algorithm is iterative, we 
set 7^ to -yj 7 := Eg 77 7 from the previous iteration, which should keep the point of expansion 

close to E g 7! 7 for the current iteration. The point of this Taylor expansion is to approximate q 1 
with a normal distribution; consider the exponent of g 7 , 


( 7 f ) ) T - ( 7 f ) ) T ( M W) - ( M W) T EfcSf 


-m 


+ (E (*&) + (*&) J - (2N - 2) 2 7 ( 7 f >) 

const (1) - ^ (yf ] -uj S (7® - uj 

+ l£(e) + («<)) 7«>-(2A f -2 )^(7>) 


where const*- 7 denotes a constant independent of \ S := E/^=i 1 i^fh) allc * u 

S' -1 ^h 1 ■ Applying the Taylor expansion in Equation ( 23 . 4 ) gives 


const* 7 - - (y* 7 - u) S' ^ 7 * 4) - u) + ( z Sj) + 7< 4) 

- (2JV - 2) 2 7 (y* 7 ) + ( ffl (t) ) T ( 7 < 7 - y* 7 ) + i (y< 7 - 7f ) ) T flf ( 7 * 7 - % W ) 
const® -i( 1 <*>- U ) T s( 1 f>-„)+ (£( 47 ) + ( 74 )) 7. 10 
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- (2N - 2) ( fll (t) ) T if + i (ifj T - (if ’) T 

= const (2) - ^ ( 7 ^ - uj S (if 1 - u) 

+ (£(«&) + <*&)) -( 2 jV- 2 )(( 9 f>) T -(iW) T £rW) 7 W 

- (JV - 1) ( 7 i t) ) T fl-i t) 7i t) - 

Define A := + (*$-<)) - ( 27V - 2 ) ((s? 5 ) - ^f^j and B := 

— (N — 1) Hf, so that we obtain 

= const (2) - i ] - uj S (if ] - uj + Aif + (ifj Bjf 
= const *' 2 - 1 - ^ (if - uj S (if - uj + A (if -u + uj 

+ (if — u + uj B (if u + uj 

= const*- 3 -* - ^ (if - uj (S - 2 B) (if - uj + (A + 2w t H) (if - uj . 

Finally, define D := A + 2 u T B and E := S - 21 f resulting in 

= const (3) - ^ (if - uj E (if — uj + D (if - uj 
= const (4) - ^ (if ] - uj E (if - uj + (E- 1 D t ) T E (if - uj 

= const (4) - * (if -u- E~ 1 D T j T E (if -u- E~ l D T j . 

Hence q 7 (jfj is approximately Normal (rf,Afj with variance and mean 
A f := E~ x 

= ( XX 1 (eg) + (27V — 2) H^j 
rf := u + E~ 1 D r 

E (4%) + (*?2.)| - (2 N - 2 ) [ 9 « + Hi (u - 7 «)] } 

\ ) 

«:= (XX'(4S)) '(e ^ 1 (=»>(/*?>))■ 
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Distribution of q c 

q c is a discrete distribution. The distribution of cf' 1 conditioned on its Markov blanket is 

P I 

oc p | 

« ( n [is.r'T 3 ') { t-lc (%“ - rf’)V (# -<■?’)} (n 


(c™ | MB if> ) 



• I / 4 c ) )p( c f ) ) 


exp 


E - hil (% ( ‘’ - A') T s; 1 1 (■,<•’ - ) + 1 } 


h = 1 
C 


= e*P E4 c “ W ^S W -(7? ) ) + 


+ 1] C S lo S 


(t) , 


, 1/2 


= exp j^-^c^tr E,/ ( 7l W (7^) -T^i/h) +Ph ) (Ph ) ) 


+ Y c ii lo s 


,1/2 


The variables 7®, . . . , 7^ . /j '^ . . . . , belong to other variational factors. The sufficient statis- 
tics of 7 and /t are 7® ^7®^ , > and /i^ (l 1 /*^) > but si nce 7 and p are marginally 

independent under <7, we can take their expectations separately. Hence 


9c (cf } ) 

:°c exp j ^ 


tr 




^(( 7 ^ 


bn 

^)bn ,+ {^bn 


p?) {r) T 


c 


Y lo § 


Sh 


h = 1 


with variational parameters (m/*^ y , ^7® (7 y , > ^7,-^)- 

press g c in terms of indices h: 


S/,1 172 j ’ 

We can also ex- 
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Distribution of 


9m 


(1) (T) 

q p is a continuous distribution. The distribution of fi\ , . . . , n K c ' conditioned on its Markov blanket 
is 

' T N 1 f C T 

n it (%'” i ■=!■>, . . . ,eg>) n » (<*“) n » (*?’ i p?- ’) 


t=l i= 1 
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+E 


The variables 7^, . . . , 7^ , c^\ . . . , c^P belong to other variational factors. The sufficient statis- 
tic of 7 and c is cf l j > but since 7 and c are marginally independent under g, we can take 

their expectations separately. Hence 


9m 


T N C 


:oc ®p EEE-o 


t=l i=l h=l 
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2 \ L,/i 
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c 
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(pP> - p) ■ 7 *-■ (p« - p) + E - i U' - pt*-‘>) ■ 7 ®-‘ (pf> - p7‘>) } 

t=2 ' J 


with variational parameters ( 7,' 
Kalman Smoother for q p 


(*)\ /„(*) 


We can apply the Kalman smoother to compute the mean and covariance of each /j^ 1 under g /( . Let 
T' (a, 6, C) := exp | — | (a — 6) T C _1 (a — 6) j, then with some manipulation we obtain 

c r jv / (in' 

) « n * (p?u®)n® «%“’>. pf’.n)'*'*' 


, (1) (T) 

Qfi ( Mi ? • • • ? Me 


T N / (t) \ 
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c 

°c n 

?i=i 

T 




Th 


(1) v, $ ) V 


' g, (cS) (•><-) „ _ 

s E n / r^\ h Y' N 

i Z_/i=l \W,h. / 






^Ef =1 (ff) 

, Ef=i (<$) 


(ti 

>/V 


,(i) 

2=1 \^i,h 

E 




Ek 


2=1 \ L '2,ft 


Notice that q ^ factorizes across cluster indices h: 


Qr- l Pi i ■ 


9/jj 




(T) 


(T) 


Qn h [Eh\- 


(T) 

>Mft 


) 


) ii 


n*w 


— 1 \^i,h J 

'E£i<4 


.a£\ 


Ef =1 <c«), 


Observe that each factor q flh . . . , /iU"* j is a linear system of the form 


(t+i) (t) . (t) 

/ft = /ft +«V 

(t) (t) . (t) 

<ft = /ft +<> 


7? t) \ v 

h / (t) E h 

y N (c (t) \ ’^ h ’yN / c w\ 

Z^i= 1 \ °2,ft / Z^i= 1 \ c 2,ft 


h 


N '-iDiJi, 


where are latent variables and are observed variables with value ay 1 = — / (t) 


Furthermore, ~ -/V (0, <b), ~ N ^0, with 


5 ^ ) = E^te) ,and ^ 1) ~ JV(t/,$) ’ 


Hence the distribution of each under <y ; , is Gaussian, and its mean and covariance can be com- 
puted using the Kalman smoother equations 


and 


r (t+i)|(t) _ r (t)\(t) 

/ft ~ /ft 

p(*+i)|(t) pMIM 

r h ~ r h 


$ 


K jt+ 1 ) = pit+mt) ^pjf+mt) 



-i 


,(t+i)i(t+i) _ f (t+i)i(t) w it+ 1) / (t+i) r (t+i)i(t) 

/ft — /ft E i\ h \a h n h 


p>(t+l)|(i+l) _ _ ^-(t+1)^ jj(t+l)|(t) 


L (t) = p (t)m ^p(t+i)m^ 1 

aWI(T) _ f ,(t)|(t) r (t) f r (t+l)\(T) „(t+l)|(i)N 

/ft - /ft + E h \ji h - n h j 

p(t)\Cr) _ p{t)\{t) _|_ p(t) ^p>(t+i)| (t) j3(*+t)l(t)^ 


Thus, Hh has mean ji . 


(t)l(O 


and covariance P, 




under q t , . 


E-Step: Solutions to Variational Parameters 

In the E-step, we find locally optimal variational parameters for each factor of q. The solutions to 
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the continuous parameters are 



while the solutions to the discrete parameters are 



k — 1 


These solutions are used to update the variational parameters in each factor of q. Note that they 
form a set of fixed-point equations that converge to a local optimum in the space of variational 
parameters. Hence the E-step involves iterating these equations until some convergence threshold 
has been reached. 


M-Step 

In the M-step, we maximize C (q, 0) with respect to the model parameters 0 = { B , S, 6, u, <b } . 
Recall that 


£(g,0) := [log p(E,X | 0) - logg(X)] . 

Note that the variational distribution q is not actually a function of the model parameters 0; the 
model parameters that appear in the q’s optimal solution come from the previous M-step, similar to 
regular EM. Hence it suffices to maximize 
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t = 2 


= E„ 
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+E„ 


i=l j^i 
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EEM^i^’M^i’’ 
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E lo sp( c fE) 
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h-1 t — 2 


Maximizing B 

Consider the independent terms in CJ (//, 0), 
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T,AT AT 
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T,iV AT 
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,(*> ,(*> 


i,i=l (t) (*) 


( 21 s independent of other latent variables under q). 

Since z^\- . ag. ■ are indicator variables, we index their possible values with k € { 1, 
l € {1 , ... ,K} , respectively: 

T,N N K,K 
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',i=l jzfii k,l—l 
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Setting the first derivative wrt B k ,i to zero yields the maximizer B k i for C! ( q , 0): 


0 = 
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dB k y 


T,N N K,K 

EE E *( 


t,i= 1 j^i k',l' = 1 


1 ->J 


k 1 , z^ 





. . . , K} and 


B k ,i)j ■ 


(23.5) 
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Maximizing E 

Consider the Ei, . . . , E^-dependent terms in CJ ( q , 0), 
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Since c, 7 are independent of each other (and other latent variables) under q. 


T,N C 

= EE-^k^lS.r)^!) (23.6) 

t,i= 1 /i=l 



where we have defined (X) := E g [X], and the solutions to ( X ) are identical to the E-step. Setting 
the first derivative wrt E^ to zero yields the maximizer E/, for C ( q , 0): 

T,N C 

0 = V Sh £ 5:-log((2^|S h |V 2 )( c W) 

t,i=l /i=l 
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Maximizing 5 

Consider the (5-dependent terms in £ (q. 0), 
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(23.7) 


where ( c^) := E q 
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, and the solution to ( cfl ) is identical to the E-step. Taking the first 
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By setting all the derivatives to zero and performing some manipulation, we obtain the maximizer 
5 for £ (q, 0): 


5 = 
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Maximizing v. <I> 


Consider the v. ^-dependent terms in C (q. 0), 
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We begin by maximizing wrt i/, which only requires us to focus on the first term: 
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where ;= J , and the solution to ([ ■* [) is identical to the E-step, 

derivative wrt i/ to zero yields the maximizer v for C! ( q , 0): 
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We now substitute v = v and consider the ^-dependent terms in C (g, 0): 
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where (X) := [X], The solutions to (AA) ^(kk (AA) 2> (A n ) \ 

tical to the E-step. The remaining expectations are 
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where P and L are defined in the section discussing the Kalman smoother. Setting the first derivative 
wrt <!> to zero yields the maximizer $ for C ( q , 0): 
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Computing the Variational Lower Bound £ (q. 0) 

The marginal likelihood lower bound £ (q, 0) can be used to test for convergence in the variational 
EM algorithm. It also functions as a surrogate for the true marginal likelihood p{E | 0); this is 
useful when taking random restarts, as it enables us to select the highest likelihood restart. Recall 
that 
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It turns out that we cannot compute £ ( q, 0) exactly because of term 2, but we can lower-bound 
the latter to produce a lower bound £i OW er (<7, 0) on £ (g, 0). 

Closed forms for terms 1,3,4, and 5 are in Equations (23.5, 23.6, 23.7, and 23.8), respectively. 
We now provide closed forms for terms 6,7,8, and 9, as well as the aforementioned lower bound for 
term 2. 

Lower Bound for Term 2 
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where A* is defined in the previous section discussing the Laplace approximation. 
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Term 6 

Define 
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where a^\ eJ*' are from the Kalman smoother. Also note our abuse of notation: <1> refer to the 
values used to compute in the E-step (see Kalman smoother section), and not 

their current values (recall that q t , is not a function of v, $). Now define 
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and where L^ 1 ' 1 are from the Kalman smoother section. 
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Using definitions from the previous section, 
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where A,| f ) is from the Laplace approximation section. 

Term 8 


Term 8 is trivial to compute since q c is discrete: 
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Term 9 


Term 9 is also trivial to compute since q z is discrete: 
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Real-world networks are inherently complex dynamical systems, where both node attributes and 
network topology change in time. These changes often affect each other, providing complex feed- 
back mechanisms between node and link dynamics. Here we propose a dynamic mixed membership 
model of networks that explicitly take into account such feedback. In the proposed model, the prob- 
ability of observing a link between two nodes depends on their current group membership vectors, 
while those membership vectors themselves evolve in the presence of a link between the nodes. 
Thus, the network is shaped by the interaction of stochastic processes describing the nodes, while 
the processes themselves are influenced by the changing network structure. We derive an efficient 
variational procedure for inference, and validate the model using both synthetic and real-world data. 
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24.1 Introduction 

Networks are a useful paradigm for representing various social, biological, and technological sys- 
tems. Modeling the structure and formation of networks is made more difficult when the nodes in 
the network and the topology of the network change over time. The growth of the internet and so- 
cial media, in particular, has provided researchers with huge amounts of data that make such studies 
both feasible and highly desirable. 

A standard approach to network modeling assumes a generative model for links based on node 
attributes. That is, the nodes or objects modeled are assumed to have some (possibly latent) at- 
tributes, e.g., group membership, and these latent properties determine the formation of links be- 
tween nodes. A version of this approach which has achieved great success is the mixed membership 
stochastic blockmodel (MMSB) (Airoldi et al., 2008). MMSBs recognize that nodes often have mul- 
tiple attributes (mixed membership) that may come into play when determining whether two nodes 
should be linked. Thus, MMSBs are a special case of a more general class of latent space models, 
which assume that nodes’ attributes are described in some abstract space, and the formation of links 
between nodes depends on the distance between their attributes in that space (Hoff et ah, 2002; 
Krioukov et ah, 2009). In MMSBs each actor is characterized by a probability distribution over his 
attributes, so the corresponding latent space is a simplex (Airoldi, 2007; Blei and Fienberg, 2007). 

A common limitation of these approaches is that the attributes of nodes are assumed to be un- 
changing over time. If the nodes represent people, for instance, we know that attributes like interests, 
location, or job may change over time and this may affect a person’s connections to the network. 
In this case, it is necessary to model the dynamics of the nodes’ hidden attributes as well. Despite 
recent progress in modeling time-varying networks (Fu et ah, 2009; Ho et ah, 2011; Kolar et ah, 
2010; Kolar and Xing, 2011; Xing et ah, 2010), there are still some open problems. In particular, 
the existing models so far have neglected the possibility that the change in a node’s attributes at one 
time step may depend on the network structure at previous time steps. The network structure, on 
there other hand, depends on the nodes’ attributes, thus resulting in a feedback loop between node 
dynamics and network evolution. 

A concrete example of this phenomena occurs in social networks. For instance, it is known that 
new friendship links are often formed as a result of selection effects like homophily: actors often 
befriend people with similar interests (Snijders et ah, 2006). In turn, social actors introduce their 
friends to new ideas and interests in a process known as social influence or diffusion. Together, these 
dynamics cause both the nodes and the network structure to evolve simultaneously. 

Our contribution is to combine a model of node dynamics that depends on network topology 
with an MMSB-inspired generative model for link formation that depends on changing node at- 
tributes. We use this model to describe the co-evolution of selection and influence for real-world 
dynamic network data. The rest of the chapter is structured as follows. We begin with a high-level 
description of dynamic networks and how we can adapt MMSBs to describe them, followed by a 
discussion of related work. In Section 24.2, we describe the details of our co-evolving mixed mem- 
bership stochastic blockmodel (CMMSB), including a discussion of how to efficiently infer model 
parameters. In Section 24.3, we apply a CMMSB to a synthetic dataset and a real-world dataset 
consisting of the bill co-sponsorship network among U.S. senators. A discussion of results follows 
in Section 24.4. We provide detailed calculations in the Appendix. 

24.1.1 Selection and Influence in Networks 

Suppose we have N nodes and we observe a network structure among them at discrete time steps, 
( = 0.1,..., 7’. If there exists a directed link from node p to node q at time t, we say Y t (p. q) = 1, 
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otherwise 0. There are many examples of real-world data that fit this format including friendship 
ties in a social network and gene regulatory networks. 

We suppose that the nodes themselves are described by some hidden attribute that changes over 
time, i.e., node p is described at time t by p p ■ F° r a social network, this vector could represent 
interests, group membership, or behavioral traits, while in a gene regulatory network this could 
indicate response to stages of a cell cycle. Then by selection we mean that the probability of a link 
between two nodes depends on their attribute vector: 

Prob (Y t (p,q) = 1) = g{n* p ,L i\). (24.1) 

One of the most famous forms of selection is homophily , or assortative mixing , which states that 
nodes tend to interact with other nodes that have similar attributes. We stress, however, that different 
selection mechanisms are possible as well, i.e., disassortative mixing patterns such as buyer-seller 
relationship, etc. 

The next step is to explicitly model the dynamics in the latent space. For instance, p may drift 
over time, or perhaps it responds to either one-time or recurring external events. As discussed in 
the introduction, we are particularly interested in modeling the influence of a node’s neighbors on 
his/her dynamics. Toward this end, we allow a feedback mechanism where an interaction at one 
time step affects the position of the node at the next time step. That is, we want to model dynamics 
of the form 

4 +1 = ( 24 - 2 ) 

where S p denotes neighbors of node p at time t. For instance, to model positive social influence 
one should select a function / such that the distance between nodes contracts after the interaction. 
It is possible to have more general (e.g., repulsive) interactions as well, depending on the concrete 
scenario. 

Together, Equations (24.1) and (24.2) provide a very high-level description of our approach. We 
would like to emphasize that while distance-based interactions (such as given by Equation (24.1)) 
are at the core of most prior work, introducing a feedback mechanism via the influence model as 
in Equation 24.2 is one of the main ideas distinguishing our approach from a previous attempt to 
formulate dynamic MMSBs in Fu et al. (2009). 

Once we have specified a model for node dynamics, the task of fixing a model for link formation 
remains. Ideally, a generative model for link formation based on the node dynamics should capture 
our intuitions about real link formation while admitting some uncertainty and allowing efficient 
inference. For these reasons, we chose to adapt MMSBs, which we describe in the next section. 

24.1.2 Mixed Membership Stochastic Blockmodels 

In this paper we will use a latent space representation of the nodes based on MMSBs (Airoldi et al., 
2008). In this section, we will purposely adhere to a high-level description of MMSBs and their 
dynamic extensions, whereas we will discuss a detailed implementation in Section 24.2. Starting 
with a static MMSB, we see that each node has a normalized mixed membership vector n p £ R A , 
which describes the probability for node p to take one of K roles. The role that a node takes in a 
particular interaction is sampled according to the membership vector, and the probability of a link 
between p. q then depends on the roles they take and the role compatibility matrix, B. The generative 
process is as follows: 


7Tp r 

^ Prior distribution 

Zp—tq r 

* Multinomial^) 

Zq—tp r 

* Multinomial (n q ) 

y(p,q) r 

- Bernoulli(Zp_ ) , g f?Zp < _ 9 ) 



530 


Handbook of Mixed Membership Models and Its Applications 


The most naive dynamic extension is to simply add a t index to all the variables in the previous 
expression. This amounts to learning T independent, static MMSBs and fails to take into account 
any of our knowledge of the underlying node dynamics. An extension considered in Fu et al. (2009) 
is to say that the prior distribution for the tv* should evolve over time. However, each mixed mem- 
bership vector is still sampled from the same distribution at each time, so the effect is to model only 
aggregate dynamics. 

In contrast, and as discussed in the previous section, we would prefer that the mixed membership 
vector of nodes evolved individually but under mutual influence. The particular form of influence 
we will study is 

Fp +1 = i 1 - Pp)l4 + Pvrfavg + noise term > (24-3) 

where iJ* avg = jgrj wV-<?/4 * s t ^ le we ighted average of node q s neighbors’ log-membership 

vectors. Thus, the membership vector of node q at time f+1 is a weighted average of his membership 
vector at time t as well as the membership vectors of the nodes he has interacted with at time t. This 
feature of our model has the desired effect of incorporating feedback between network structure and 
individual node dynamics. The relative importance of the neighbors is captured by the parameter 
0 < f3 p < 1; larger /3 P means that node p is more susceptible to influence from his neighbors. 

Before proceeding further, we note that exact inference is not feasible even for static MMSBs, so 
adding dynamics to a model makes the inference problem much harder. Here we use a variational 
EM approach that allows us to do efficient approximate inference (Beal and Ghahramani, 2003; 
Xing et al., 2003). 

24.1.3 Related Work 

The problem of properly characterizing selection and influence has been a subject of extensive stud- 
ies in sociology. For instance, Steglich et al. (2010) suggested a continuous time agent-based model 
of network co-evolution. In this model, each agent is characterized by a certain utility function that 
depends on the agent’s individual attributes as well as his/her local neighborhood in the network. 
The agents evolve as continuous-time Markovian processes which, at randomly chosen time points, 
select an action to maximize their utility. Despite its intuitive appeal, a serious shortcoming of this 
model is that it cannot handle missing data well, thus most of the attributes have to be fully ob- 
servable. This was addressed in Fan and Shelton (2009), where a continuous dynamic Bayesian 
approach was developed. Continuous-time models have certain advantages when the network ob- 
servations are infrequent and well-separated in time. In situations where more fine-grained data is 
available, however, discrete-time models are more suitable (Hanneke et al., 2010). 

The model represented here is based on MMSBs (Airoldi et al„ 2008). MMSBs are an extension 
of stochastic blockmodels that have been studied extensively both in social sciences and in computer 
science (Holland et al., 1983; Goldenberg et al., 2010). In a stochastic blockmodel each node is as- 
signed to a block (or a role), and the pattern of interactions between different nodes depends only 
on their block assignment. Many situations, however, are better described by multi-faceted interac- 
tions, where nodes can bear multiple latent roles that influence their relationships to others. MMSBs 
account for such “mixed” interactions by allowing each node to have a probability distribution over 
roles and by making the interactions role-dependent (Airoldi et al., 2008). A different approach 
to mixed membership community detection has been developed in physics (Ball et al., 2011; Ahn 
et al., 2010). In particular, Ahn et al. (2010) suggested a definition of communities in terms of links 
rather than nodes. 

Previously, several works have considered a dynamic extension of the MMSB which we will 
henceforth refer to as dMMSB (Fu et al., 2009; Ho et al., 2011; Xing et al., 2010). In contrast to 
dMMSB, where the dynamics were imposed externally , our model assumes that the membership 
evolution is driven by the interactions between the nodes through a parametrized influence mecha- 
nism. At the same time, the patterns of those interactions themselves change due to the evolution of 
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the node memberships. An advantage of the present model over dMMSB is that the latter models the 
aggregate dynamics, e.g., the mean of the logistic normal distribution from which the membership 
vectors are sampled. CMMSB, however, models each node’s trajectory separately, thus providing 
better flexibility for describing system dynamics. Of course, more flexibility comes at a higher com- 
putational cost, as CMMSBs track the trajectories of all nodes individually. This additional cost, 
however, can be well justified in scenarios when the system as a whole is almost static (e.g., no shift 
in the mean membership vector), but different subsystems experience dynamic changes. One such 
scenario that deals with political polarization in the U.S. Senate is presented in our experimental 
results section. 


24.2 Co-evolving Mixed Membership Blockmodel 

Consider a set of N nodes, each of which can have K different roles, and let 7 r* be the mixed 
membership vector of node p at time t. Let Y t be the network formed by those nodes at time t: 
Y t (p, q) = 1 if the nodes p and q are connected at time t, and Y t (p, q) = 0 otherwise. Further, let 
Yq-t = { Y > , Y\ , Yp} be a time sequence of such networks. The generative process that induces 
this sequence is described below. 

• For each node p at time t = 0, employ a logistic normal distribution 1 to sample an initial 
membership vector, 

*p,k = ex P idp.k - C'(Mp)), dp ~ Affa 0 , A), 

where C(p) = log(]C fc exp(/ifc)) is a normalization constant, and a 0 , A are the prior mean and 
covariance matrix. 


• For each node p at time t > 0, the mean of each normal distribution is updated due to influence 
from the neighbors at its previous step: 


o.p — (1 ftp) dp A Ppds^- 1 > 


where // 51 . is the average of the weighted membership vector //-s of the nodes which node p is 
connected to at time t: 


1 



Y w l^ q dq- 

96S* 


/3 p describes how easily the node p is influenced by its neighbors, while the weights, w. allow 
for different degrees of influence from different neighbors. The membership vector at time t is 


F l k = exp - C(t 4)), d* P ~ AA(a* , E m ), 

where the covariance E ; , accounts for noise in the evolution process. 

• For each pair of nodes p, q at time f, sample role indicator vectors from multinomial distribu- 
tions: 

~ Multinomial ( 77 *), z t pi _ q ~ Multinomial ( 77 *). 

Here Zp_> g is a unit indicator vector of dimension K, so that z p ^ q j. = 1 means node p under- 
takes role k while interacting with q. 

*We found that the logistic normal form of the membership vector suggested in Fu et al. (2009) leads to more tractable 
equations compared to the Dirichlet distribution used for static MMSBs. 



532 


Handbook of Mixed Membership Models and Its Applications 


• Sample a link between p and q as a Bernoulli trial: 

Yt{p, Q ) ~ Bernoulli((l - p)z t p [^ q B t z t p ^ q ), 

where B is a K x K role compatibility matrix, so that B* s describes the likelihood of interaction 
between two nodes in roles r and s at time t. When B* is diagonal, the only possible interactions 
are among the nodes in the same role. Here p is a parameter that accounts for the sparsity of the 
network (Airoldi et al., 2008). 

Thus, the coupling between dynamics of different nodes is introduced by allowing the role vector of 
a node to be influenced by the role vectors of its neighbors. To benefit from computational simplicity, 
we updated 7r by changing its associated p. This update of p is a linear combination of its current 
state and the values of its neighbors’ current states. The influence is measured by a node-specific 
parameter /3 P , and iv p< _ q , which need to be estimated from the data. j3 p describes how easily the 
node p is influenced by its neighbors: /3 P = 0 means it is not influenced at all, whereas f3 p = 1 
means the behavior is solely determined by the neighbors. On the other hand, w p< _ q reflects the 
weight of the specific influence that node q exerts on node p, so that larger values correspond to 
more influence. 

24.2.1 Inference 

Under the CEMMSB, the joint probability of the data Yq-.t and the latent variables {p\. N , z p rq : 
p,q £ N, z p ^ q : p,q £ N} can be written in the following factored form. To simplify the notation, 
we define z p q as a pair of z p ^ q , and z p ^ q . Also, denote the sets of latent group indicators {z p _> q : 
p,q £ N}, and {z p ^ q : p,q £ N} as Z^, and 7}_^. 

P (Yo :T , Pi%, Z^ T , Z^l T |a, A, B, /3 P , w p< _ q , E^) = (24.4) 

n n 9)1 z p,9’ st ) p ( z p,«i4’ /4) 

t P ,q 

x n p <^pi Q ' 0 ’ a ) n p (/4K _i > ^s ;- 1 > Pp)- 

P t^O 

In Equation (24.4), the term describing the dynamics of the membership vector is defined as fol- 
lows: 2 


Pidpldp i Ms* -1 ) Yt, j3p) 
/ G (x, E m ) 

fb(P p i 


fc(p p fb{p p , fhs * -1 )j 

1 _ 1 rpT y — 1^. 

p 2 H 

(27r) fc / 2 |E M |i/ 2 

(1 — fip)Pp + fipP-s^- 1 - 


(24.5) 


As we already mentioned, performing exact inference with this model is not feasible. Thus, 
one needs to resort to approximate techniques. Here we use a variational EM approach (Beal and 
Ghahramani, 2003; Xing et al., 2003). The main idea behind variational methods is to posit a sim- 
pler distribution q(X) over the latent variables with free (variational) parameters, and then fit those 
parameters so that the distribution is close to the true posterior in KL divergence. 


D K L(q\\p) = f q{X) log dX ' 


(24.6) 


Here we introduce the following factorized variational distribution: 


2 For simplicity, we will assume E M is a diagonal matrix. 
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Z% T , Zrh'X- *% T , *°<?) = (24.7) 

n 9 i(Mpi 7 p, s p) n (72(4^^9)92(4^4^9))’ 

p,£ 

where (/! is the normal distribution, and q 2 is the multinomial distribution, and 7*, £* , 4->9’ 4-s-g 
are the variational parameters. Intuitively, 4->g 9 ' s the probability of node p undertaking the role 
g in an interaction with node q at time t, and 4<-g h ' s defined similarly. 

For this choice of the variational distribution, we rewrite Equation (24.6) as follows: 

D K L(q\\p) = (24.8) 

E q [log nri 9i (4i7p’ E p)] + ^ pog n n 72 (4— >9 1 4“ >9 ) 

t p t p,q 

+^nn 72 (Zp^gl 4 ^g) --^g^Og nn P{Yt(p,q) I ^p—^qf ^p 4 —qi B)] 

t p,q t p,q 

-^[logjin p ( 4 ->gl 4 ) ~ E ql lo S, nn p ( Z p<-q\i l q) 
t p,q t p,q 

- ^ pog n n p (4 i4 _i ’ ^ pog n p (4 1 «°’ ^)] • 

t^O p p 

In the third line of the above equation, we need to compute the expected value of 
logE* ex P(ftfc)] under the variational distribution, which is problematic. Toward this end, we in- 
troduce N additional variational parameters £, and replace the expectation of the log by its upper 
bound induced from the first-order Taylor expansion (Blei and Lafferty, 2007): 

lo g£>PM < i°gC - 1 + ^ 5Z exp (^ fe )- (24.9) 

The variational EM algorithm works by iterating between the E-step of calculating the expectation 
value using the variational distribution, and the M-step of updating the model (hyper)parameters so 
that the data likelihood is locally maximized. The pseudo-code is shown in Algorithm 1, and the 
details of the calculations are discussed below. 

24.2.2 Variational E-step 

In the variational E-step, we minimize the KL distance over the variational parameters. Taking the 
derivative of KL divergence with respect to each variational parameter and setting it to zero, we 
obtain a set of equations that can be solved via iterative or other numerical techniques. For instance, 
the variational parameters (<j>p_> q , <j>p<- q ) corresponding to a pair of nodes (p, q) at time t, can be 
found via the following iterative scheme: 

4^g,9 0C exp(7p S ) ]^[(i?(p, h) Yt ( p,q \l - B(g,h)) 1 ~ Y ^Y^ h - (24.10) 

h 

^ h cxexp(yl h )l[(B(g,h) Y ^\l - B(g,h)) 1 ~ Y ^Y^ h - (24.11) 

9 

In the above equations, 4-s-g g ar, d 4<-g h are normalized after each update. Note also that Equa- 
tions (24.10) and (24.11) are coupled with each other as well as with the parameters 7* g , 7* h . 

Sets of variational parameters, {7}* and are initialized at the beginning of variational EM. 
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Input: data Y t (j>, q), size N, T, K 
Initialize all {7}*, {er }* 

Start with an initial guess for the model parameters. 

repeat 

repeat 

for t 0 to T do 
repeat 

Initialize (t>p^ q , <P pi - q to ^ for all g , h 

repeat 

Update all {(/)}* 
until convergence of {fY 
Find {7}*, {a} 4 
Update all {£}* 
until convergence in time t, 
end for 

until convergence across all time steps 
Update hyperparameters, 
until convergence in hyperparameters 

Algorithm 1: Variational EM. 

For {7}*, we sample it from normal distribution Af(a° , A), and for {<r}* we initialize it to the same 
value over all nodes across the whole time steps. Once the {</>}* are converged to optimal points, 
we then update {7}* and {tr}* using the update equations. Both of the variational parameters do not 
have closed forms of solution, and the details are given in the KL-Distance section of the Appendix. 
Here we simply note that their general form is: 

7 p = /7(7p _1 ,7p +1 ,7g,^ 9 ,^p,Cp,S*). (24.12) 

Thus, the parameter 7* depends on its immediate past and future values, 7* _1 and 7* +1 , as well as 

the parameters of its neighbors. 

For the variational parameters of a covariance matrix £* , which is assumed to be a diagonal 
matrix with components ((o’ t p l ) 2 , (a* 2 ) 2 , ...(cr* fc ) 2 ), the general form of the optimal point is : 

4,k = f°('r t p,k>®- (24.13) 

Finally, for the variational parameters ( we have 

< P = Y1 ex P(7p,j + ^Y~)- (24.14) 

Note that the above equations can be solved via a simple iterative update as before. To expedite 
convergence, however, we combine the iterations with the Newton-Raphson method, where we 
solve for individual parameters while keeping the others fixed, and then repeat this process until all 
the parameters have converged. 

24.2.3 Variational M-step 

The M-step in the EM algorithm computes the parameters by maximizing the expected log- 
likelihood found in the E-step. The model parameters in our case are: If, the role compatibility 
matrix, the covariance matrix S M , f} p for each node, w pi _ q for each pair, a, and A from the prior. 

If we assume that the time variation of the block compatibility matrix is small compared to the 




Mixed Membership Blockmodels for Dynamic Networks with Feedback 


535 


evolution of the node attributes, we can neglect the time dependence in B and use its average across 
time, which yields: 


B(g,h) = 


E p,q,t Y t(P^) ■ 


p^q,gVp-^q,h 


E 


A\t skt 

p,q,t ^p—>Q,g^p<—q,h 


(24.15) 


Likewise, for the update of diagonal components of the noise covariance matrix S M , 


(%) 2 = N p_ - (! - (*Kk - ( 24 - 16 > 


Similar equations are obtained for [} p and w pi _ q . The update equation of f3 p and w p< _ q is a function 
of 7 and a, which are related to the transition for specific node p. 


^E>oEfc(7p,fc 'Yp,k r Yp,k ^P,k 75* 1 i fc 7p,fc75* Lfc) 

Et>oEfe(7p, fc + a k — 2 7 p ,fc 75*-\fc) + Et>o Efc(75* fc + fc) 


where 7 54 and L5/ are the mean and covariance of a set of nodes which node p is connected to at 
time t. 

The priors of the model can be expressed in closed form as below: 



p 


(24.17) 


ak 




) 2 -2 «£t p° fc + K) 2 ). 


(24.18) 


24.3 Results 

24.3.1 Experiments on Synthetic Data 

We tested our model by generating a sequence of networks according to the process described 
above, for 50 nodes, and K = 3 latent roles across T = 8 time steps. We used a covariance matrix 
of A = 3 1, and mean a 0 having homogeneous values for the prior, so that initially nodes had a well- 
defined role (i.e., the membership vector would have peaked around a single role). More precisely, 
the majority of nodes had around 90% of membership probability mass centered at a specific role, 
and on average a third of those nodes had 90% on role k. For the role compatibility matrix, we gave 
high weight at the diagonal. 

Starting from some initial parameter estimates, we performed variational EM and obtained re- 
estimated parameters which were very close to the original values (ground truth). With those learned 
parameters, we inferred the hidden trajectory of agents as given by their mixed membership vector 
for each time step. The results are shown in Figure 24.1, where, for three nodes, we plot the projec- 
tion of trajectories onto the simplex. One can see that for all three nodes, the inferred trajectories 
are very close to the actual ones. 
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FIGURE 24.1 

Actual and inferred mixed membership trajectories on a simplex. 

24.3.2 Comparison with dMMSB 

As a further verification of our results, we compare the performance of our inference method to the 
dynamic mixed membership stochastic blockmodel (dMMSB) (Fu et al., 2009). We use synthetic 
data generated in a manner similar to the previous section. This time, though, for simplicity we keep 
K = 2 and set all the /3s to some constant for all the nodes: (3 = 0.1 in one trial and j3 = 0.2 in 
the other. In this case, we compare performance by evaluating the distance in L 2 norm between 
actual and inferred mixed membership vectors for each method. At each time step, we calculate the 
average over all nodes of the L 2 distance from the actual membership vector. 

As we show in Figures 24.2(a) and 24.2(b), CMMSBs capture the dynamics better than the 
dMMSBs. This is due to the fact that our model tracks all of the nodes individually (internal dy- 
namics), while dMMSBs regard the dynamism as an evolution of the environment (external dy- 
namics). Here, we have only included results for relatively small and homogeneous dynamics. In 
fact, we noticed that our method tends to fare even better as we increase the degree of dynamics or 
the heterogeneity of dynamics across nodes (node-varying values of (3). We believe heterogeneous 
dynamics are more prevalent in real systems, and so we expect our method to outperform dMMSB 
even more than is indicated by Figure 24.2(b). 

24.3.3 U. S. Senate Co-Sponsorship Network 

We have also performed some preliminary experiments for testing our model against real-world 
data. In particular, we used senate co-sponsorship networks from the 97th to the 104th Senate, 
by considering each senate as a separate time point in the dynamics. There were 43 senators who 
remained part of the senate during this period. For any pair of senators (p, q) in a given senate, we 
generated a directed link p — >• q if p co-sponsored at least three bills that q originally sponsored. 
The threshold of three bills was chosen to avoid having too dense of a network. With this data, we 
wanted to test (a) to what extent senators tend to follow others who share their political views (i.e., 
conservative vs. liberal) and (b) whether some senators change their political creed more easily than 
others. 


Mixed Membership Blockmodels for Dynamic Networks with Feedback 


537 


Error Comparison with p = 0.1 across all nodes Error Comparison with p = 0.2 across all nodes 



(a) 


(b) 


FIGURE 24.2 

Inference error for dMMSB and CMMSB for synthetic data generated with K = 2 and j3 = 0.1 for 
all the nodes (a), and when j3 = 0.2 for all the nodes (b). 


The number of roles K = 2 was chosen to reflect the mostly bi-polar nature of the U.S. Senate. 
The susceptibility of senator p to influence is measured by the corresponding parameter (3 P , which 
is learned using the EM algorithm. High /? means that a senator tends to change his/her role more 
easily. Likewise, the power of influence of senator q on senator p is measured by the parameter 
Wpi_ q , where Wp<_ qi > Wp<_ q2 means senator q\ is more influential on senator p than senator 
q 2 . Here the direction of the arrow reflects the direction of the influence which is opposite to the 
direction of the link. To initialize the EM procedure, we assigned the same f3 and w to all the 
senators, and start with a matrix which is weighted at the diagonal for B. 

Another method for validation is to compare the degree of influence. Our model handles and 
learns the degree of influence in the update equation. Sorting out influential senators is an area of 
active research. Recently, KNOWLEGIS has been ranking U.S. senators based on various criteria, 
including influence, since 2005. Since our data was extracted from the 97th Senate to the 104th 
Senate, direct comparison of the rankings was impossible. Another study (Maisel, 2010) ranked the 
10 most influential senators in both parties who have been elected since 1955. We compared our top 
five influential senators, and were able to find three senators (Senator Robert Byrd, Senator Strom 
Thurmond, and Senator Bob Dole) on the list. 

24.3.4 Interpreting Results 

The role compatibility matrix learned from the variational EM has high values on the diagonal 
confirming our intuition that interaction is indeed more likely between senators that share the same 
role. Furthermore, the learned values of /? showed that senators varied in their “susceptibility.” In 
particular. Senator Arlen Spector was found to be the most influenceable one, while Sen. Dole was 
found to be one of the most inert ones. Note that while there are no direct ways of estimating the 
“dynamism” of senators, our results seem to agree with our intuition about both senators (e.g.. Sen. 
Spector switched parties in 2009 while Sen. Dole became his party’s candidate for President in 
1996). 

To get some independent verification, we compared our results to the yearly ratings that the ACU 
(American Conservative Union) and ADA (Americans for Democratic Action) assign to senators. 3 


3 Accessible at http://www.conservative.org/, http://www.adaction.org/. 
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ACU/ADA rated every senator based on selected votes which they believe display a clear ideological 
distinction, so that high scores in the ACU mean that they are truly conservative, while lower scores 
in the ACU suggest they are liberal, and for the ADA vice versa. To compare the ratings with our 
predictions (given by the membership vector) we scaled the former to get scores in the range [0, 1], 
Figure 24.3 shows the relationship between these scores and our mixed membership vec- 
tor score, confirming our interpretation of the two roles in our model as corresponding to lib- 
eral/conservative. Although these values cannot be used for quantitative agreement, we found that at 
least qualitatively, the inferred trajectories agree reasonably well with the ACU/ADA ratings. This 
agreement is rather remarkable since the ACU/ADA scores are based on selected votes rather than 
co-sponsorship network as in our data. 


Correlation between Inference and ACU score 



Correlation between Inference and ADA score 



FIGURE 24.3 

Correlation between ACU/ADA scores and inferred probabilities. 

Of course, we are most interested in correctly identifying the dynamics for each senator. We 
compare our inferred trajectory of the most dynamic senator, and the inert senator to the scores 
of the ACU and ADA. In Figure 24.4 the scores of the ADA have been flipped, so that we can 
compare all of the scores in the same measurement. However, since ACU/ADA scores are rated for 
every senator each year, the dynamics of inference and the dynamics of ACU/ADA scores cannot 
be compared one to one. Not all senators showed high correlation of the trend like Sen. Specter and 
Sen. Dole. 

24.3.5 Polarization Dynamics 

The yearly ACU/ADA scores give a good comparison of the relative political position of senators 
scored in each year. However, they are not very appropriate for comparison between years, a point 
illustrated by the fact that the score is based on voting records for different bills in each year. 
Therefore, for validation of the dynamics we turn to another scoring system highly regarded by 
political scientists and used to observe historical trends, the DW-NOMINATE score. For the time 
period of our study, McCarty et al. (2006) shows that the political polarization of the senate was 
increasing. In particular, they show that the gap between the average DW-NOMINATE score of 
Republicans and Democrats is monotonically increasing, as we show in Figure 24.5. In fact, the 
polarization for the entire senate was stronger every year. This is due to the unbalanced seats in the 
entire senate. In other words, our data had 22 Republicans and 21 Democrats, while for the entire 
senate, majority outnumbered minority by around 10 seats. For comparison, for each time step we 
took the average of our inferred score for the 14 most and least conservative senators. As we show 
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FIGURE 24.4 

Comparison of inference results with ACU and ADA scores: Sen. Specter (left) and Sen. Dole 
(right). 


in Figure 24.5, our inferred result agrees qualitatively with the results of McCarty et al. (2006), 
showing an increase in polarization for every senate in the studied time-window. Since the DW- 
NOMINATE score uses its own metric, and our polarization is measured by the difference between 
upper average and lower average probability, we should not expect to get quantitative agreement. 
We would like to highlight, however, that the direction of the trend is correctly predicted for each of 
the eight terms. 



Congress number 

FIGURE 24.5 

Polarization trends during the 97th-104th U. S. Congresses. 


24.4 Discussion 

We have presented the CMMSB for modeling inter-coupled node and link dynamics in networks. 
We used a variational EM approach for learning and inference with CMMSB, and were able to 
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reproduce the hidden dynamics for synthetically generated data, both qualitatively and quantita- 
tively. We also tested our model using the U.S. Senate bill co-sponsorship data, and obtained reason- 
able results in our experiments. In particular, CMMSBs were able to detect increasing polarization 
in the senate as reported by other sources that analyze individual voting records of the senators. 

Our results with the U.S. Senate dataset suggest that our dynamical model can actually capture 
some nuances of individual dynamics. While we lack a ground truth for the true position of senators, 
third party analyses qualitatively support the findings of our model. Of course, many factors are not 
explicitly modeled in our approach, but we hope that by including individual dynamical terms we 
capture these effects implicitly. For instance, external events like upcoming re-election campaigns 
surely affect senator’s actions. While the true chain of events may rely on these events, if all relevant 
external events are not or cannot be included in our model, then capturing dynamics through shifts 
in observed relationships is a good proxy. 

The approach to modeling influence described in Section 24.2 is only one of several possibilities. 
Although we learned a static parameter /? for each node, describing how easily influenced they are, 
we also pointed out the possibility of adding a weight that varies for each pair: that is, a node may 
be more influenced by one person than another. Additionally, someone’s influence may change over 
time. Finally, we chose a simple linear influence mechanism. In principle, someone may be more 
influential along one axis than another. For instance, a node may be influenced by a friend’s musical 
taste, but not by his politics. 

As future work, we intend to test our model against different real-world data, such as commu- 
nication networks or co-authorship networks of publications. We also plan to extend CMMSBs in 
several ways. A significant bottleneck of the current model is that it explicitly considers links be- 
tween all the pairs of nodes, resulting in a quadratic complexity in the network size. Most real-world 
networks, however, are sparse, which is not accounted for in the current approach. Introducing spar- 
sity into the model would greatly enhance its efficiency. We note that this is also a drawback for 
static MMSBs, but progress has already been made towards reducing this complexity (Mprup et ah, 
2011 ). 

An additional drawback of of MMSBs (and stochastic blockmodels in general) is the inability 
to properly deal with degree heterogeneity. Indeed, MMSBs (or related latent space models) might 
assign nodes to the same group based merely on the frequency of their interactions with the other 
nodes. Possible remedies are found in the degree-correct blockmodel recently proposed in Karrer 
and Newman (2011) or in exponential random graph models that separately model node and group 
variability (Reichardt et ah, 2011). The problem reveals a fundamental ambiguity about network 
modeling. A priori, we have no reason to believe that node connectivity is a less important dimen- 
sion for clustering nodes than homophily for some hidden attribute. Our intuition leads us to expect 
otherwise for human networks, but this intuition must be explicitly modeled. In the co-sponsorship 
network studied here, most senators are well-connected and so the network structure is better ex- 
plained by political views than node connectivity. However, large variability in node connectivity 
has been observed in many social networks where this effect will have to be explicitly modeled. 
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Appendix 

Alternative View of EM Algorithm 

We start with the log-likelihood function where Y is the data, X is the set of latent variables, and 0 
is the set of model parameters: 


logp(F|0) = log j p(Y, X\Q)dX 


(24.19) 


= log J q(X) 


P(Y,X |0) 

q(X) 


dX 


> [ q(X) log — — ;*!*> dX (Jensen’s Inequality). 


q(x) 


We define the lower bound as free energy: 


= logP(F|0) - D KL {q{X)\\p{Y,X)). 


(24.20) 


The goal is to maximize the lower bound (free energy) by updating q and 0. In E-step, we minimize 
the KL-distance of two distributions, and in M-step, we maximize the free energy under fixed q 
distribution obtained in E-step. 

KL-Distance 


Here we present the KL-distance between q(X), and p( Y. X): 

D KL (q\\p) = 

E E (~l E M - - 7^] - log(27r) fc / 2 - log(|Ep| 1//2 ) 


t p 


+ E E E ^q.s lo s + E E E lo s €^ q ,h 
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T V -1 


2 

-EEE ^p^q,h i^q^h -iogc* + i-^E exp(7* fc H — ^—)) 


■(dl-Mdl - (fc/2) log(27r) - log(|Ep| 1/2 )) 

- ^2(-\ E q[(Fp ~ a°) T A- 1 (/r° - a 0 )] - log(27r) fc/2 - log(|A| 1/2 )) , 


(24.21) 
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where exp(7 pfe + comes from the moment-generating function of the normal distribution, 
Mx(t ) := E[e tx ], with t=l. The first line simplifies to const — k log <r p k , where, once again, 
we have taken the covariance matrix to be diagonal. 

Variational E-step 


In the variational E-step, we minimize the KL distance over the variational parameters. Variational 
parameters {7 P }* and { a P:k \ f need to be solved analytically. We use the Newton-Raphson method 
as an optimization algorithm for tightening the bound with respect to those variational parameters. 

First, we minimize the divergence with respect to 7*. Since the other variational parameters E p 
are assumed to be a diagonal matrix, we treat the multivariate normal distribution as a combination 
of independent normal distribution and update the mean and variance for each coordinate. We use 
the Newton-Raphson method for each coordinate where the derivative is : 


dD KL {q\\p)/djl k = 


«fc ) 2 


El + EE Pp^q,g ex P(7 P ,fc + 2 ) 


q 9 


El ^qi-Pjk + EE ^qf—p,h ex P(7 p, k + ^2 ) 

q q h 
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(24.22) 


7^ t = (w p< _ q )( 7* ife ) is the mean of set of neighbors of node p at time t, and cr|, k = 

J2 qGS t {utp^g) 2 fc )“ are the variance of set of neighbors of node p at time t. Mean and variance 
of neighbors can be easily computed since the components of neighbors are independent of each 
other and are Gaussian themselves. The derivative above is valid for 7* k when 0 < t < T; the form 
is slightly different when t = 0 or t T. 

Second, we minimize the divergence with respect to ((a* -l) 2 , (ct* 2 ) 2 > ■■■( cr p x) 2 ) using the 
Newton-Raphson method. The derivative with respect to cr p k is : 
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where r] k is the diagonal component of the covariance matrix E /t . When t = 0 or t = T, the 
derivative slightly differs from the above equation. 

Variational M-step 


P(Y | 0 ) > 



log p(Y,X\Q) 

Q(X) 


(24.24) 


The M-step in the EM algorithm computes the hyperparameters by maximizing the lower bound 
under fixed q found in the E-step. The lower bound of the log-likelihood is from Jensen’s inequality 
(Equation 24.24), and the expectation is taken with respect to a variational distribution. Hence the 
general form of the update equation at the fcth step is as below: 


0 = arg max J q k (X)\ogp(Y, X\<d)dX. (24.25) 

Since the final form of most model parameters are quite intuitive, we only derive Equation (24.17) 
in this section. To obtain the update equation of ri p , we start from differentiating the expected log- 
likelihood and setting it to zero: 
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Solving the equation above. 
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For solving the optimal weight, we differentiate the lower bound with respect to w p <- qi 
to zero: 


(24.26) 
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Networks allow the representation of interactions between objects. Their structures are often com- 
plex to explore and need some algorithmic and statistical tools for summarizing. One possible way 
to go about this is to cluster their vertices into groups having similar connectivity patterns. 

This chapter aims to present an overview of clustering methods for network vertices. Com- 
mon community structure searching algorithms are detailed. The well-known stochastic blockmodel 
(SBM) is then introduced and its generalization to overlapping mixed membership structure closes 
the chapter. Examples of application are also presented and the main hypothesis underlying the 
presented algorithms is discussed. 
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25.1 Introduction 

Because networks are a straightforward formalism for representing interactions between objects of 
interest, they are used in many scientific fields. In biology, regulatory networks allow us to describe 
the regulation of gene expression through transcriptional factors (Milo et ah, 2002), while metabolic 
networks focus on representing pathways of biochemical reactions (Lacroix et ah, 2006). Besides, 
the binding procedures of proteins are often described as protein-protein interaction networks (Al- 
bert and Barabasi, 2002; Barabasi and Oltvai, 2004). In social sciences, networks are widely used to 
represent relational ties between actors (Snijders and Nowicki, 1997; Nowicki and Snijders, 2001; 
Palla et ah, 2007). Other examples of networks are powergrids (Watts and Strogatz, 1998) and the 
World Wide Web (Zanghi et ah, 2008). 

As a network describes the presence or absence of links between objects, the notion of groups of 
nodes having a similar behavior naturally arises. In some cases, this notion of similarity is even the 
process from which the network originates. Common affinities will, for example, lead to edges in 
social networks, whereas gene duplication is the main growth process of protein-protein interaction 
networks. 

The most widely assumed group structure is the partition, where each node belongs to only one 
group. When dealing with real-world applications, this assumption of empty intersections between 
groups is often too rigid. For instance, so-called moonlighting proteins are known to have several 
functions in the cells (Jeffery, 1999). Considering social networks, actors typically belong to several 
groups of interests (Palla et ah, 2005). Thus, exploring structures which allow for more complex 
membership for each node is of great practical interest. One possibility consists in considering 
models where each individual is allowed to belong to all groups depending on a mixed membership 
coefficient, all membership coefficients summing to 1 . This approach is considered, for example, in 
the latent Dirichlet allocation model (Blei et ah, 2003), in the context of text mining, or in the mixed 
membership stochastic blockmodel (Airoldi et ah, 2008) for the clustering of nodes in networks. An 
alternative is to consider that each node belongs to multiple groups, but that for each possible group 
the node either belongs to the group or it does not. 

In this chapter, we propose to give an overview of the methods using the latter approach, i.e., 
retrieving group memberships of nodes based on their connectivity pattern, the memberships of 
each node being summarized in a {0, l}-vector. The first section introduces the notion of networks 
and the characteristics of real networks one should have in mind when building models. The second 
section deals with the partitioning of nodes, i.e., methods assigning each vertex to exactly one group. 
The last section presents generalizations of those methods which allow for the overlapping groups 
of nodes. 


25.2 Networks and Their Characteristics 
25.2.1 Network Representations 

A network is commonly represented by a graph Q = (V,£), where V is a set of N vertices and 
£ is a set of edges between pairs of vertices. The graph is said to be directed (Figure 25.1) if the 
pairs (u,v) in £ are ordered. Conversely, unordered pairs form an undirected graph (Figures 25.2 
and 25.3). Note that the edges can be weighted by a function w : £ — > F for any set F. However, 
we will concentrate only on binary graphs, i.e., F = {0, 1}. The size of Q is then given through the 
edge count m = \£\. The graph is said to be dense if m is close to the maximal number M of edges, 
whereas a low value of m leads to a sparse graph. To characterize the density of Q, a criterion 5(G) 
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is often used. It is defined as the ratio of the number m of existing edges over the number M of 
potential edges: 

= w 

For a directed graph, M = N 2 while M = N(N + l)/2 otherwise. If Q does not contain any 
self loop, i.e., an edge from a vertex to itself, then M = N(N — 1) for a directed graph and 
M = N(N — l)/2 otherwise. 

The neighborhood Ng(u) of vertex u is defined as the set of all the vertices connected to u. Its 
degree d(u) is equal to its number of incident edges. Finally, a path from a vertex u to a vertex v is 
a sequence of edges in £ starting at vertex Vq = u and ending at vertex Vk+\ = v: 

If there exists at least one path between every pair of vertices, then the graph is said to be connected. 
For instance, the graph in Figure 25.1 is connected contrary to the graphs in Figures 25.2 and 25.3, 
which have some isolated vertices. 

A network can equivalently be represented by a so-called adjacency matrix X, which describes 
the presence or absence of an edge in a graph. As mentioned already, we focus on binary graphs 
and therefore X is in {0 , \} NxN . Thus, if there exists an edge from vertex i to vertex j, then X, :) 
equals 1 and 0 otherwise. If the network is undirected, the matrix is symmetric, i.e., X l3 and X r , 
are equals. Non-zero entries of the diagonal correspond to self-loops. Every property of a graph can 
be interpreted in terms of its adjacency matrix: the degree of a vertex is the sum of the row or the 
column corresponding to it, or the fact that two vertices (i,j) are in different connected compounds 
is equivalent to ( X k )ij = 0 for all power 1 < k < N. 

25.2.2 Properties of Real Networks 

Most real networks have been shown to share some properties (Albert et ah, 1999; Broder et ah, 
2000; Dorogovtsev et ah, 2000; Amaral et ah, 2000; Strogatz, 2001) that we briefly recall in the 
following: 

• Sparsity: The number of edges is linear in the number of vertices. In other terms, the mean 
degree remains bounded when N grows, implying that the density tends to 0. 

• Existence of a giant component: Real networks are often disconnected. However, a majority 
of the vertices are contained in the same component, the other components being significantly 
smaller. 

• Degree heterogeneity: A few vertices have a lot of connections while most of the vertices 
have very few links. The degrees of the vertices are sometimes characterized using a scale-free 
distribution (e.g., see Barabasi and Albert, 1999). 

• Small world: The shortest path from one vertex to another is generally rather small, typically 
of size 0(log N). 

All the properties listed above can be verified through easy computable statistics which are the 
degrees and the paths of length at most N. As they are key properties in the interpretation of real 
network behaviors with respect to information diffusion (Pastor-Satorras and Vespignani, 2001) or 
attack tolerance (Albert et al„ 2000), we would like our random graph models to produce networks 
with similar properties. 

Most of the real networks exhibit another property, which is the one of interest in this chapter, 
namely an underlying group structure. This means that nodes can be spread into classes having 
similar connectivity patterns. In order to retrieve such structures, statistical and algorithmic tools 
have been developed. 
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FIGURE 25.1 

Subset of the yeast transcriptional regulatory network (Milo et al., 2002). Nodes of the directed 
network correspond to genes, and two genes are linked if one gene encodes a transcriptional factor 
that directly regulates the other gene. 



FIGURE 25.2 

The metabolic network of bacteria Escherichia coli (Lacroix et al., 2006). Nodes of the undirected 
network correspond to biochemical reactions, and two reactions are connected if a compound pro- 
duced by the first one is a part of the second one (or vice-versa). 
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FIGURE 25.3 

Subset of the French political blogosphere network. The data consists of a single day snapshot 
of political blogs automatically extracted on October 14th, 2006 and manually classified by the 
“Observatoire Presidentielle project” (Zanghi et al., 2008). Nodes correspond to hostnames and 
there is an edge between two nodes if there is a known hyperlink from one hostname to another (or 
vice-versa). 


25.3 Graph Clustering 

In this section, we concentrate on the classification of vertices depending on their connection pro- 
files. There has been a wealth of literature on the topic which goes back to the earlier work of 
Moreno (1934). As shown in Newman and Leicht (2007), it appears that available methods can 
be grouped into three significant categories. First, some models look for community structure, also 
called assortative mixing (Newman, 2003; Danon et al., 2005), where vertices are partitioned into 
classes such that vertices of a class are mostly connected to vertices of the same class. Other mod- 
els look for disassortative mixing in which vertices mostly connect to vertices of different classes. 
They are commonly used to analyze bipartite networks (Estrada and Rodriguez-Velazquez, 2005). 
Finally, a few procedures look for heterogeneous structure where vertices can have different types 
of connection profiles. In particular, they can be used to uncover both community structure and 
disassortative mixing. 

In this section, we describe some of the most widely used graph clustering methods. Note that 
many model-free approaches exist (Fortunato, 2010). However, except for the algorithmic approach 
presented in Section 25.3.1, we concentrate in the following on methods which rely on statistical 
models only. 

25.3.1 Community Structure 

Most graph clustering methods aim at detecting community structure, also called assortative mixing, 
meaning the appearance of densely connected groups of vertices, with only sparser connections 
between groups (Figure 25.4). Most of them rely on the modularity score of Newman and Girvan 
(2004). However, we point out the recent work of Bickel and Chen (2009) who showed that these 
algorithms are (asymptotically) biased and that using modularity scores could lead to the discovery 
of an incorrect community structure, even for large graphs. 
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FIGURE 25.4 

Example of an undirected affiliation network with 50 vertices. The network is made of three com- 
munities represented in red, blue, and green. Vertices connect mainly to vertices of the same com- 
munity. 


Modularity Score 

Newman and Girvan (Girvan and Newman, 2002; Newman and Girvan, 2004) proposed several 
intuitive community detection algorithms which involve iterative removal of edges from the network 
to split it into communities. Edges to be removed are identified using one of a number of possible 
betweenness measures. All of them are based on the same idea: If two communities are joined by 
only a few inter community edges, then all paths from vertices in one community to vertices in the 
other must pass along one of those few edges. Therefore, given a suitable set of paths, we expect the 
number of paths that go along an edge to be the largest for inter community edges. First, Newman 
and Girvan introduced the edge betweenness, a generalization to edges of the vertex betweenness 
measure of Freeman (1977). The edge betweenness of an edge is defined as the number of shortest 
paths between all pairs of vertices in the network that run along that edge. Second, they considered 
the random walk betweenness. The expected number of times a random walk between a particular 
pair of vertices will pass down a particular edge is calculated. This expected value is then summed 
over all pairs of vertices to obtain the random walk betweenness of the edge. As shown in Newman 
and Girvan (2004), other scores can obviously be considered to obtain algorithms that may be more 
appropriate for some applications. However, it appears that the choice of measure does not highly 
influence the result of the algorithms. On the other hand, the recalculation step after each edge 
removal is crucial (see Algorithm 1). 

All these algorithms produce a dendrogram (Figure 25.5) which represents an entirely nested 


repeat 

Calculate betweenness scores for all edges Remove the edge with the highest score 
until No edges remain ; 


Algorithm 1: Example of a community structure detection algorithm with a betweenness score. 
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hierarchy of possible community divisions for the network. In order to select one of these divisions, 
Newman and Girvan (2004) proposed a modularity criterion. Consider a particular division with Q 
communities and denote e q i as the fraction of all edges in the network that link vertices in com- 
munity q to vertices in community l. Moreover, consider the fraction a q = YmLi e qi °* edges that 
connect to vertices of community q. The modularity criterion is then given by: 

Q 

Qmod, = y '( e qq - (%) ■ (25.1) 

<z=l 

The criterion is computed for all the divisions, and a division is chosen such that the modularity is 
maximized. Note that modularity can be generalized to both directed and valued graphs (Fortunato, 
2010 ). 

A limiting factor of these community detection algorithms is their poor scaling with the num- 
ber to of edges and the number N of vertices in the network. For instance, calculating the shortest 
paths between a particular pair of vertices can be done in 0(m) (Ahuja et al., 1993; Cormen et al., 
2001). Because they are 0{N 2 ) vertex pairs, the computational cost to compute all the edge be- 
tweenness scores is in 0{mN 2 ). This complexity was improved independently by Newman (2001) 
and Brandes (2001) finding all betweennesses in 0{mN). Since this calculation has to be repeated 
for the removal of each edge, the entire algorithm runs in worst-case time 0(m 2 N). In other words, 
for dense networks, where m is in 0(N 2 ), it runs in 0(N 5 ) while it scales in 0{N 3 ) for sparse 
networks, where to is linear in N. 

Rather than building the complete dendrogram (with edge removals) and then choosing the op- 
timal division using the modularity criterion, Newman (2004) suggested to focus directly on the 
optimization of the modularity. Thus, he proposed an algorithm which falls in the general category 
of agglomerative hierarchical clustering methods (Everitt, 1974; Scott, 2000). Starting with a con- 
figuration in which each vertex is the sole member of one of N communities, the communities are 
iteratively joined together in pairs, choosing at each step the join that results in the greatest increase 
(or smallest decrease) in mod (25.1). Again, this leads to a dendrogram for which the best cut is 
chosen by looking for the maximal value of the modularity. The computational cost of the entire al- 
gorithm is in O ((to + N)N), or 0(N 3 ) for dense networks and 0{N 2 ) for sparse networks. It was 
shown to be capable of handling a collaboration network with 50,000 vertices in Newman (2004). 

Latent Position Cluster Model 

An alternative approach for community detection in networks is the latent position cluster model 
(LPCM) of Handcock et al. (2007). Consider a N x N binary adjacency matrix X such that X, :l 
equals 1 if there is an edge from vertex i to vertex j, and 0 otherwise. Moreover, let us define Y 
as covariate information where Y.y denotes some observed characteristics about the pair (i. j) of 
vertices. This might represent, for instance, the traffic information of users from blog i to blog j 
in a blogosphere network (see Figure 25.3). Several characteristics can possibly be observed for 
each pair of vertices and therefore Y,j can be vector valued. Note that a few other random graph 
models have been proposed in the literature to take covariates into account (see e.g., Zanghi et al., 
2010; Mariadassou et al., 2010). They will not be considered in this chapter as we consider vertices 
clustered by the use of network topology only. Here, we describe LPCM in a general setting, as 
in Handcock et al. (2007), and emphasize that the algorithm can also be used if Y is not available 
simply by removing the terms in Y (J in the following expressions. 

LPCM assumes that the network does not contain any self loop while both directed and undi- 
rected relations can be analyzed. It is assumed that each vertex, usually called actor in social sci- 
ences, has an unobserved position in a d dimensional Euclidean latent space as in Hoff et al. (2002). 
Given the latent positions and the covariate information, the edges are assumed to be drawn from a 
Bernoulli distribution: 


'Ziij 7ij ,Y ij ^ Bern (y (fl 2 , .Zj )) • 
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FIGURE 25.5 

Dendrogram of a network with 50 vertices for the community detection algorithm with edge be- 
tweenness. It should be read from top to bottom. The algorithm starts with a single community 
which contains all the vertices. Edges with the highest edge betweenness are then removed itera- 
tively splitting the network into several communities. After convergence, each vertex, represented 
by a leaf of the tree, is a sole member of one of the 50 communities. 
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The function g(x) 
by: 


(1 + e x ) Ms the logistic sigmoid function. Moreover, az,,z, ; y,, is given 


"z.z.y.-, Yj j f3 0 -p 1 \Z t Z 


(25.2) 


where /3 0 has the same dimensionality as Y.y and /?i is a scalar. Both /3 0 and /3 1 are unknown 
parameters to be estimated. To represent clustering, the positions are assumed to be drawn from 
a finite mixture of Q multivariate normal distributions, each one representing a different class of 
vertices. Each multivariate distribution has its own mean vector as well as spherical covariance 
matrix: 

Q 

Z i ~ ^ot q N{n q ,alT), 

9=1 


and a denotes a vector of class proportions which satisfies a q > 0 , V<? and a q : 1- Finally, 

according to LPCM, the latent positions Zi, . . . , Zjv are i.i.d. and given this latent structure, all 
the edges are supposed to be independent. Consider now the second term on the right-hand side of 
(25.2). By construction, if /3i is positive, we expect the Lq distance | Zj — Zj | to be smaller if ver- 
tices i and j are in the same class. In other words, the probability g ( a z . z ,y, ■ ) of an edge between i 
and j is supposed to be higher for vertices sharing the same class. Note that this corresponds exactly 
to the definition of a community. 

Handcock et al. (2007) proposed a two-stage maximum likelihood approach and a Bayesian 
algorithm, as well as a BIC criterion to estimate the number of latent classes. The two-stage maxi- 
mum likelihood approach first maps the vertices in the latent space and then uses a mixture model 
to cluster the resulting positions. In practice, this procedure converges more quickly but loses some 
information by not estimating the positions and the cluster model at the same time. Conversely, 
the Bayesian algorithm (see Figure 25.6), based on Markov chain Monte Carlo, estimates both the 
latent positions and the mixture model parameters simultaneously. It gives better results but is time 
consuming. Both the maximum likelihood and the Bayesian approach are limited in the sense that 
they can handle networks with a few hundreds of vertices only. 


25.3.2 Heterogeneous Structure 

So far, we have seen some algorithms to uncover communities. However, some vertices may be 
grouped while exhibiting connection patterns differently from a dense group poorly linked to the rest 
of the network. In genetic regulatory networks, transcription factors co-regulating some biological 
process may, for example, not be linked to one another but act jointly on the regulated genes. Some 
other approaches which can look for heterogeneous structure in networks, where vertices can have 
different types of connection profiles, have therefore been developed. 

Hofman and Wiggins’ Model 

Fet us consider a binary adjacency matrix X representing a network Q. The model of Hofman and 
Wiggins (Hofman and Wiggins, 2008) associates to each vertex of the network a latent variable Z, 
drawn from a multinomial distribution: 

Zj ~ Multinom (1, a = (cti, . . . , ccq)) . (25.3) 

As in other standard mixture models, the vector Zj has all its components set to zero except one 
such that Zi q equals 1 if vertex i belongs to class q. Thus, X^=i ^ iq = Vi and the vector a. 
satisfies a q > 0, Wq as well as ff, C q=[ a q = 1. The edges are then assumed to be drawn from a 
Bernoulli distribution: 


Xij ~ Bern(A) 
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FIGURE 25.6 

Directed network of social relations between 18 monks in an isolated American monastery (Samp- 
son, 1969; White et ah, 1976). Sampson collected sociometric information using interviews, ex- 
periments, and observations. This network focused on the relation of “liking.” A monk is said to 
have a social relation of “like” to another monk if he ranked that monk in the top three monks for 
positive affection in any of the three interviews given. The positions of the vertices in the two data 
dimensional latent space have been calculated using the Bayesian approach for LPCM. The position 
of the three class centers found are indicated, as well as circles with radius equal to the square root 
of the class variances estimated. 
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FIGURE 25.7 

Example of an undirected network with 20 vertices. The connection probabilities between the two 
classes in red and green are higher than the intra class probabilities. Vertices connect mainly to 
vertices of a different class. 


if vertices i and j are in the same class, i.e., Z, = Zj, and 

Xij ~ Bern(e) 

otherwise. Thus, the model is able to take into account both community structure (A > e) (Fig- 
ure 25.4) and disassortative mixing (A < e) (Figure 25.7). As in the previous section, given the 
latent variables Zi, . . . , Zjv, all the edges are supposed to be independent. In order to estimate the 
posterior distribution p( Z, a, A, e| X) over the latent variables and model parameters, Hofman and 
Wiggins (2008) used a variational Bayes expectation maximization (EM) algorithm with a factor- 
ized distribution. 

Moreover, they proposed a model selection criterion to estimate the number of latent classes in 
networks. It relies on a variational approximation of the marginal log-likelihood logp(X) and has 
shown promising results. 

Stochastic Blockmodels 

Originally developed in social sciences, the stochastic blockmodel (SBM) is a probabilistic general- 
ization (Fienberg and Wasserman, 1981; Holland et ah, 1983) of the method described in White et al. 
(1976). Given a network, it assumes that each vertex belongs to a hidden class among Q classes and 
uses a matrix n to describe the intra and inter connection probabilities (Frank and Harary, 1982). 
No assumption is made on the form of the connectivity matrix such that very different structures 
can be taken into account. In particular, SBM can characterize the presence of hubs which make 
networks locally dense (Daudin et al., 2008). Moreover, and to some extent, it generalizes many of 
the existing graph clustering techniques, as shown in Newman and Leicht (2007). For instance, the 
model of Hofman and Wiggins can be seen as a constrained SBM where the diagonal of n is set to 
A and all the other elements to e. 

Formally, SBM considers a latent variable Z f , drawn from a multinomial distribution (25.3), for 
each vertex in the network, as in Section 25.3.2. Thus, each vertex belongs to a single class, and that 
class is q if Z lq equals 1. The edges are then assumed to be drawn from a Bernoulli distribution: 

Xij | Zjq Z j i — 1 ~ Bern(7r g ;), 

where n is a Q x Q matrix of connection probabilities. Again, given all the latent variables, the 
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edges are supposed to be independent. Note that SBM was originally described in a more general 
setting (Nowicki and Snijders, 2001), allowing any discrete relational data. However, as explained 
in Section 25.2.1, we concentrate on binary edges only. 

The identifi ability of the parameters in SBM was studied by Allman et al. (2009; 2011), who 
showed that the model is generically identifiable up to a permutation of the classes. In other words, 
except in a set of parameters which has a null Lebesgue’s measure, two parameters imply the same 
random graph model if and only if they differ only by the ordering of the classes. 

Many methods have been proposed in the literature to jointly estimate SBM model parameters 
and cluster the vertices of the network. They all face the same difficulty. Indeed, contrary to many 
mixture models, the conditional distribution of all the latent variables Z and model parameters, 
given the observed data X, can not be factorized due to conditional dependency. Therefore, opti- 
mization techniques such as the expectation maximization (EM) algorithm can not be used directly. 
In the case of SBM, Nowicki and Snijders (2001) proposed a Bayesian probabilistic approach. They 
introduced some prior Dirichlet distributions for the model parameters and used Gibbs sampling 
to approximate the posterior distribution over the model parameters and posterior predictive dis- 
tribution. Their algorithm is implemented in the software BLOCKS, which is part of the package 
StoCNET (Boer et al., 2006). It gives accurate a posteriori estimates but can not handle networks 
with more than 200 vertices. Daudin et al. (2008) proposed a frequentist variational EM approach for 
SBM which can handle much larger networks and developed an integrated classification likelihood 
(ICL) criterion for the model selection. Latouche et al. (2011) adapted it in a Bayesian framework, 
yielding an algorithm which retrieves better small classes and does the model selection with a non- 
asymptotic criterion. Online strategies have also been developed (Zanghi et al., 2008), as well as 
extensions to deal with discrete or continuous edges (Mariadassou et al., 2010). 


25.4 Overlapping Clustering 

As mentioned previously, most graph clustering methods suffer from the restriction they impose by 
requiring that each vertex belongs to exactly one class. We present in this section some algorithmic 
and statistical adaptations of the existing clustering methods which tackle this issue. We focus here 
on the methods by assigning to each vertex a vector of {0, 1}^, where Q denotes the number of 
classes. In other words, each individual belongs completely to all groups it participates in. Methods 
using vectors of coefficients summing to 1 and giving the relative importance of each class in the 
individual behavior have also been developed (Blei et al., 2003; Airoldi et al., 2008). 

25.4.1 Algorithmic Approaches 

The issue of overlapping clustering has received growing attention in the last few years, starting 
with an algorithmic approach based on clique percolation developed by Palla et al. (2005) and im- 
plemented in the software CFinder (Palla et al., 2006). In this approach, a /.'-clique community 
is defined as the union of all k -cliques (complete sub-graphs of size k) that can be reached from 
each other through a series of adjacent /e-cliques. 1 Given a network, the algorithm first locates all 
cliques and then identifies the communities using a clique-clique overlap matrix (Everett and Bor- 
gatti, 1998). By construction, the resulting communities can overlap. In order to select the optimal 
value of k, Palla et al. (2005) suggested a global criterion which looks for a community structure as 
highly connected as possible. Small values of k lead to a giant community which smears the details 
of a network by merging small communities. Conversely, when k increases, the communities tend 

•Two /.'-cliques are adjacent if they share k — 1 vertices. 
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to become smaller, more disintegrated, but also more cohesive. Therefore, they proposed a heuristic 
which consists of running their algorithm for various values of k and then selecting the lowest value 
such that no giant community appears. 

Shen et al. (2009) adapted the classification method of Girvan and Newman to overlapping 
clusters in a method called EAGLE. To do so, they first built a bottom-up dendrogram starting with 
some well-chosen and possibly overlapping maximal cliques. At each step, a distance was computed 
for every pair of communities based on the proportion of edges linking those communities. The 
two nearest ones were then merged. The cut level of the dendrogram was chosen according to a 
generalization of the modularity to overlapping communities, namely: 

Qov = 2m £ £ o^j ( Xij ~ im"> 5{Cu Cj) ’ 

q ij J 

where O, is equal to the number of communities i belongs to. It can be shown that if all 0, s are 
equal to 1, this expression is equal to the modularity defined in Equation (25.1). The contribution of 
each edge then decreases when its incident vertices belong to several communities. 

However, those algorithmic procedures are limited to the detection of communities. Statistical 
tools are then needed to find overlapping heterogeneous structures. 

25.4.2 Overlapping Stochastic Blockmodel 

Let us now investigate the adaptation of the stochastic blockmodel to overlapping classes. The 
hidden structure can no longer be a mixture model, so the constraints Z lq = 1 and a q = 1 
present in SBM are relaxed. Thus, a new latent vector Z, is introduced for each vertex i of the 
network. This vector is composed from Q independent Boolean variables Zi q € {0, 1} drawn from 
a multivariate Bernoulli distribution: 

Q Q 

Z i ~ Bern (Z iq ; a q ) = a q iq (l - a q ) 1 ~ Ziq . (25.4) 

g=l 9=1 

We point out that Z, can also have all its components set to zero, which is a useful feature in practice 
as we shall see in Section 25.4.2. The edge probabilities are then given by: 

Xij\Zi,Zj ~ Bern (X^; g(a Zi , Zj )) = e XijaZi - z i g{-a z , ,z, ), 

where 

aZi.Zj = Zj W Zj + Zj U + V T Zj +W*, (25.5) 

and g(x) = (1 + e ~ x )~ 1 is the logistic sigmoid function. W is a Q x Q real matrix, whereas U 
and V are Q-dimensional real vectors. The first term in the right-hand side of (25.5) describes the 
interactions between the vertices i and j. If i belongs only to class q and j only to class l, then 
only one interaction term remains (Zj W Z ( = W q i). However, the model can take more complex 
interactions into account if one or both of these two vertices belong to multiple classes (Figure 
25.8). Note that the second term in (25.5) does not depend on Zj. It models the overall capacity 
of vertex i to connect to other vertices. By symmetry, the third term represents the global tendency 
of vertex j to receive an edge. These two parameters U and V are related to the sender/receiver 
effects 6, and jj in the latent cluster random effects model (LCREM) of Krivitsky et al. (2009). 
However, contrary to LCREM, 6, = Zj U and jj = V T Zj depend on the classes. In other words, 
two different vertices sharing the same classes will have exactly the same sender/receiver effects, 
which is not the case in LCREM. Finally, we use the scalar W* as a bias, to model sparsity. 

If we associate to each latent variable Z; a vector Z i = ( Zj, l) T , then (25.5) can be written: 

az,,z ; = Z,; T WZj, 


(25.6) 
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FIGURE 25.8 

Example of a directed graph with three overlapping clusters. 


where 


W = 


fw u\ 

W*J ■ 


The Z,(Q +1 )S can be seen as random variables drawn from a Bernoulli distribution with proba- 
bility aQ + 1 = 1. Thus, one way to think about the model is to consider that all the vertices in the 
graph belong to a (Q + l)-th cluster which is overlapped by all the other clusters. In the following, 
we will use (25.6) to simplify the notations. 

Finally, given the latent structure Z = {Zi, . . . , Z y } , all the edges are supposed to be in- 
dependent. Thus, when considering directed graphs without self-loop, the overlapping stochastic 
blockmodel (OSBM) is defined through the following distributions: 


N Q 

p(z i a ) = n n a ? i9 ( i ~ a q) iZiq i <’ 25 - 7 ) 

i = 1 q—1 


p(X | Z, W) -II' V -'" Z ' Z "Z.z )• 
The graphical model of OSBM is given in Figure 25.9. 
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FIGURE 25.9 

Directed acyclic graph representing the overlapping stochastic blockmodel. Nodes represent random 
variables, which are shaded when they are observed, and edges represent conditional dependencies. 


Modeling Sparsity 

As mentioned in 25.2, real networks are often sparse and it is crucial to distinguish the two sources 
of non-interaction. Sparsity might be the result of the rarity of interactions in general but it might 
also indicate that some class ( intra or inter ) connection probabilities are close to zero. For instance, 
social networks are often made of communities where vertices are mostly connected to vertices 
of the same community. This corresponds to classes with high intra connection probabilities and 
low inter connection probabilities. In (25.5), we notice that W* appears in az^z for every pair 
of vertices. Therefore, W* is a convenient parameter to model the two sources of sparsity. Indeed, 
low values of W* result from the rarity of interactions in general, whereas high values signify that 
sparsity comes from the classes (parameters in W, U, and V). 

Modeling Outliers 

When applied to real networks, graph clustering methods often lead to giant classes of vertices 
having low output and input degrees (Daudin et al., 2008; Latouche et ah, 2010). These classes are 
usually discarded and the analysis of networks focuses on more highly structured classes to extract 
useful information. The product of Bernoulli distributions (25.7) provides a natural way to encode 
these “outliers.” Indeed, rather than using giant classes, OSBM uses the null component such that 
Z,; = 0 if vertex i is an outlier and should not be classified in any class. 

Identifiability 

As in the case of the SBM, reordering the Q classes of the OSBM and doing the corresponding 
modification in a and W does not change the generative random graph model. 

There is another family of operations which does not change the generative random graph model, 
which we call inversions. They correspond to fix a subset S C \ .... . (} and to exchange the labels 
0 to 1 and vice-versa on the coordinates of the Z, s included in S. To give an intuition, let us consider 
the inversion with S = 1. If we denote by “cluster 1” the vertices whose Z, s have a 1 as the first 
coordinate, the initial graph sampling procedure consists of sampling the set “cluster 1” and then 
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drawing the edges conditionally on that information. After the inversion, it samples the vertices 
which are not in “cluster 1“ and draws the edges conditionally on that information, which is an 
equivalent procedure. 

As shown in Latouche et al. (201 1), the OSBM is generically identifiable up to permutations of 
the classes and inversions. In other words, except in a set of parameters which has a null Lebesgue’s 
measure, two parameters imply the same random graph model if and only if the second can be 
obtained from the first by a permutation and an inversion. 

Parameter Estimation 

The log-likelihood of the observed dataset is defined through the marginalization p(X | a, W) = 
p(X, Z | a, W). This summation involves 2 N( - terms and quickly becomes intractable. To 
tackle this issue, the EM algorithm has been applied on many mixture models. However, the E- 
step requires the calculation of the posterior distribution p{ Z | X, a, W) which cannot be factor- 
ized in the case of networks (Daudin et al., 2008). In order to obtain a tractable procedure, some 
approximations based on global and local variational techniques have to be done. 

The global variational technique consists of considering, for any distribution q( Z), the decom- 
position 

logp(X | a, W) = Cml{Ti (y - W) -h KL (<?(-) || jp(-| X,«, W)) , (25.8) 

where 

C M L(q\ o=, W) = ^q(Z)log j ’ (25 ' 9) 

and KL(- || •) is the Kullback-Leibler divergence. The maximum logpfX | a. W) of the lower 
bound Cml (25.9) is reached when q( Z) = p( Z | X, a,W). Thus, if the posterior distribution 
p( Z | X, a , W) was tractable, the optimizations of Cml and logp(X | a, W), with respect to a. and 
W, would be equivalent. However, in the case of networks, p { Z | X, a, W) cannot be calculated, 
and Cml cannot be optimized over the entire space of g(Z) distributions. Thus, the optimization is 
restricted to the class of distributions which satisfy: 

N 

q(Z)=Y[q(Z i ), (25.10) 


g(Z,:) = JjBern (Z iq ; r iq ), 

q=l 

= Ylf q i9 ( 1 - T i q ) 1 ~ Ziq - 

9=1 

Each Ti q is a variational parameter which corresponds to the posterior probability of node i belong- 
ing to class q. 

This global variational approximation is sufficient to obtain a tractable problem in the case of 
SBM. Unfortunately, in the case of OSBM, a term Ez,.z, [logg(— az,.z,)] appears when writing 
down the complete formula of Cml (<?). Since the logistic sigmoid function is non linear, it cannot 
be computed analytically. Thus, we need a second level of approximation to optimize the lower 
bound of the observed dataset. It consists of again considering a lower bound and new parameters 
such that the bound is tight for the optimal values of the parameters. 
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More precisely, given a variational parameter fy, Ez ; .Zj [logg(— az^z )] satisfies: 
E Zl ,z J [log 5 (-a ij )] > logsfe) - iT ' TW ^ ' C,: - Afe)(E z „z J [(zVWZ j ) 2 ]-^). (25.11) 

Eventually, it leads to the two steps approximation: 

logp(X|a,W) >£ml(?; a,W ) > C ML (q; a,W,£)- ( 25 . 12 ) 

The developed expression of a - W, £) is then tractable. It can be found in Latouche 

et al. (2011). The resulting variational EM algorithm (see Algorithm 2) alternatively computes the 
parameters the posterior probabilities r,, and the parameters a and W maximizing 

ma xC M l(q; 

Such a procedure is related to the work of Ghahramani and Jordan (1997) and their use of 
variational approximations to perform inference in factorial hidden Markov models. 

// INITIALIZATION; 

Initialize x with an ascendant hierarchical classification algorithm; 

Sample W from a zero mean er 2 spherical Gaussian disUibution; 

// OPTIMIZATION; 

repeat 

// ^-transformation; 

Cij y / Tr^W T EjW ) + Tj T W T E,;WTj, Vi ^ j; 

// M-step; 
a g A- 

Optimize cp W, £ j with respect to W, with a gradient-based optimization algorithm 

(e.g., quasi-Newton method of Broyden et al., 1970); 

// E-step; 

repeat 

for i=l:N do 

Optimize Cml (q', ot, W, j with respect to Xj, with a box constrained (T lq £ [0, 1]) 
gradient-based optimization algorithm (e.g., Byrd method, Byrd et al., 1995); 

end 

until x converges', 
until Cml (fT, oc, W, £ j converges'. 

Algorithm 2: Overlapping stochastic blockmodel for directed graphs without self loop. 

The computational cost of the algorithm is equal to 0(N 2 Q 4 ). For comparison, the computa- 
tional cost of the methods proposed by Daudin et al. (2008) and Latouche et al. (2010) for (non- 
overlapping) SBM is equal to 0(N 2 Q 2 ). Analyzing a sparse network with 100 nodes takes about 
ten seconds on a dual core, and about a minute for dense networks. 
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25.5 Discussion 

Clustering aims at summarizing the information of a dataset. When considering graphs, a 
widespread way of summarizing the set of vertices and edges consists of forming groups of ver- 
tices exhibiting similar connectivity patterns. In this chapter, we reviewed different models and 
algorithms dealing with this kind of clustering problem. The review went from simple to more com- 
plex approaches. Each presented approach assumed a particular structure. Although the overlapping 
structure generalizes simple community models or stochastic blockmodels, it does not mean that 
it should always be the default choice. Indeed the overlapping stochastic blockmodel has many 
parameters and may not be as stable as simpler models. Following the Occam’s razor principle, 
preferring the simple model leads often to sounder solutions. This basic statistical remark leads us 
to state that model choice strategy is an important topic of research worth exploring for practical 
graph clustering application. 

Choosing between models for clustering is usually performed using two kinds of strategies. 
The strategy tests different types of models and chooses the model maximizing a model choice 
criterion (Bayesian information criterion, . . . ) (Kemp and Tenenbaum, 2008). The second strategy 
explores the model space while estimating the parameters. Developing such strategies for choosing 
the number of clusters but also for choosing the type of model (SBM, OSBM, . . . ) would be of 
interest for the rapidly developing field of graph clustering. 
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